On Fri, 23 Mar 2001, Dan Sugalski wrote:

> At 11:41 PM 3/22/2001 +0000, Nicholas Clark wrote:
> >On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> 
> hadn't thought of. If we do, then something as simple as this:
> 
>    while (<IN>) {
>          $count++ if /bar/;
>          print OUT $_;
>    }
> 
> would potentially result in the output file being rather different from the 
> input file. Equivalent, yes, but different. Whether that's bad or not is an 
> open question.

        I'll just toss my 0.01 cents in...my thought here is that this
thread has now tied up a lot of cycles from a lot of very smart, very
experienced people without resulting in an answer that is clearly The
Right Thing.  Whatever we do, there is a problem at some point...if we do
normalizations internally for some functions, then you end up with a
situation like the code above, which looks like it should produce
identical input and output files, but won't necessarily.  OTOH, if we
don't do normalizations, then (e.g.) length() can return different values
for different representations of the same string.
        
        My suggestion is, let's punt on this one...make it the
programmer's responsibility to ensure that Unicode strings are represented
in the desired way.  To be honest, I suspect that most people will find it
easier to store one piece of information in their heads ("I'm dealing with
Unicode, so I need to handle representation") rather than several ("Things
that might change my encoding: pattern match, ord, length, ..., wurzle
(but only if I've frobbed the blarg")...).  We then just provide a
function like the following:

        represent( $foo, 'UTF-8C')
        represent( $foo, 'UTF-8D')
        represent( $foo, 'UTF-16C')
        represent( $foo, 'UTF-16D') ...etc

Maybe a couple of extra switchs on regexen as well, to force the matched
string (and any captured substrings) to be in a particular encoding.

                        Dave Storrs

Reply via email to