On Wed, 3 Sep 2003, Bart Schuller wrote: > On Wed, Sep 03, 2003 at 01:05:21PM +0200, [EMAIL PROTECTED] wrote: > > use Encode 'from_to'; > > > > my $orjan = 'ÖRJAN'; > > my $lundstrom = 'LUNDSTRÖM'; > > > > print $orjan . ' ' . $lundstrom . "\n"; > > > > from_to $orjan,'latin1','utf-8'; > > from_to $lundstrom,'latin1','utf-8'; > > It is my understanding that from_to is the wrong thing to use here. The > variables $orjan and $lundstrom contain perl strings containing perl > characters with unicode semantics.
I think I'm starting to understand... Terry Jones pointed the problem with from_to. I removed those calls, added the use utf8; pragma and 'rewrote' my program in utf-8 using the recode program. Then it started to work as expected. > from_to is used to encode bytes in one encoding into bytes in another > encoding. Both before and after this operation do these bytes *not* > equal characters for perl. So you should not use perl level operations > like uc or lc or regexes or substr on them. > > The way it all is supposed to work is: > > - you obtain some character data, for example by putting it literally in > your script. If the script itself is in utf-8, it should contain > "use utf8;". If not (like your script), perl will assume ISO-8859-1. Exactly, this is my working case above. > A different source of data would be reading from a file, which is > opened with the correct encoding specified (see Andreas' reply). I tested that. I created a file with fake personal names, first in iso-8859-1 and just read it. Didn't work that well. Then I transformed it into utf-8 and doing binmode STDOUT, ":utf8"; binmode STDIN, ":utf8"; it worked. > A third source would be by reading a file or a socket and obtainng raw > bytes which can be interpreted as characters using decode(). > > - Manipulate the data using perl string operations > > - Output the data to a filehandle which is opened using the correct > encoding. > > The from_to function looks enticing, particularly because everyone has > heard about perl and utf8 strings, when it's almost always the wrong > thing to use. And perl does not use utf8, but supports unicode character > semantics. And the problem is to grasp the difference! Thanks a lot all of you! Sigge > -- > Bart. >