Re: UTF-8 case conversion

sigfrid . lundberg Wed, 03 Sep 2003 06:00:15 -0700

On Wed, 3 Sep 2003, Bart Schuller wrote:

> On Wed, Sep 03, 2003 at 01:05:21PM +0200, [EMAIL PROTECTED] wrote:
> > use Encode 'from_to';
> >
> > my $orjan = '�RJAN';
> > my $lundstrom = 'LUNDSTR�M';
> >
> > print $orjan . ' ' . $lundstrom . "\n";
> >
> > from_to $orjan,'latin1','utf-8';
> > from_to  $lundstrom,'latin1','utf-8';
>
> It is my understanding that from_to is the wrong thing to use here. The
> variables $orjan and $lundstrom contain perl strings containing perl
> characters with unicode semantics.


I think I'm starting to understand... Terry Jones pointed the problem with
from_to. I removed those calls, added the use utf8; pragma and 'rewrote'
my program in utf-8 using the recode program. Then it started to work as
expected.

> from_to is used to encode bytes in one encoding into bytes in another
> encoding. Both before and after this operation do these bytes *not*
> equal characters for perl. So you should not use perl level operations
> like uc or lc or regexes or substr on them.
>
> The way it all is supposed to work is:
>
> - you obtain some character data, for example by putting it literally in
>   your script. If the script itself is in utf-8, it should contain
>   "use utf8;". If not (like your script), perl will assume ISO-8859-1.

Exactly, this is my working case above.

>   A different source of data would be reading from a file, which is
>   opened with the correct encoding specified (see Andreas' reply).

I tested that. I created a file with fake personal names, first
in iso-8859-1 and just read it. Didn't work that well. Then I transformed
it into utf-8 and doing

binmode STDOUT, ":utf8";
binmode STDIN, ":utf8";

it worked.

>   A third source would be by reading a file or a socket and obtainng raw
>   bytes which can be interpreted as characters using decode().
>
> - Manipulate the data using perl string operations
>
> - Output the data to a filehandle which is opened using the correct
>   encoding.
>
> The from_to function looks enticing, particularly because everyone has
> heard about perl and utf8 strings, when it's almost always the wrong
> thing to use. And perl does not use utf8, but supports unicode character
> semantics.

And the problem is to grasp the difference! Thanks a lot all of you!

Sigge



> --
> Bart.
>

Re: UTF-8 case conversion

Reply via email to