Re: UTF-8 case conversion

Jarkko Hietaniemi Wed, 03 Sep 2003 05:55:59 -0700

> > use Encode 'from_to';
> > 
> > my $orjan = 'ÖRJAN';
> > my $lundstrom = 'LUNDSTRÖM';
> > 
> > print $orjan . ' ' . $lundstrom . "\n";
> > 
> > from_to $orjan,'latin1','utf-8';
> > from_to  $lundstrom,'latin1','utf-8';
> 
> It is my understanding that from_to is the wrong thing to use here. The


Your understanding is correct.

> - you obtain some character data, for example by putting it literally in
>   your script. If the script itself is in utf-8, it should contain
>   "use utf8;". If not (like your script), perl will assume ISO-8859-1.

Or "use encoding 'whatever';", and Perl actually assumes whatever is
your native encoding, be it ISO 8859-1, or -2, or CP1252, or EBCDIC,
or whatever.

>   A different source of data would be reading from a file, which is
>   opened with the correct encoding specified (see Andreas' reply).
> 
>   A third source would be by reading a file or a socket and obtainng raw
>   bytes which can be interpreted as characters using decode().

In this case, e.g.:

$lundstrom = decode("latin-1", $lundstrom);

> - Manipulate the data using perl string operations
> 
> - Output the data to a filehandle which is opened using the correct
>   encoding.
> 
> The from_to function looks enticing, particularly because everyone has
> heard about perl and utf8 strings, when it's almost always the wrong
> thing to use. And perl does not use utf8, but supports unicode character
> semantics.

At least in the current Encode doc there is a section:

B<CAVEAT>: The following operations look the same but are not quite so;

  from_to($data, ïso-8859-1", ütf8"); #1
  $data = decode(ïso-8859-1", $data);  #2

Both #1 and #2 make $data consist of a completely valid UTF-8 string
but only #2 turns utf8 flag on.  #1 is equivalent to

  $data = encode(ütf8", decode(ïso-8859-1", $data));

See L</"The UTF-8 flag"> below.

> -- 
> Bart.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

Re: UTF-8 case conversion

Reply via email to