Re: Decoding more languages

Nick Ing-Simmons Tue, 13 Apr 2004 03:48:56 -0700

Octavian Rasnita <[EMAIL PROTECTED]> writes:
>    Oh, sorry, but I've made a mistake when writing the message.
>The Romanian language uses ISO-8859-2 and not ISO-8859-1
>So the question remains. Is it possible to decode a text written in more
>languages that use more charsets?

Yes. But perhaps not as easily as you would like.
You need markers which show where the encodings change.

For perl purposes the language is not important, it is the 
"charset" (encoding) that matters. The encoding determines what 
the 8-bit bytes (also called octets) in a file mean as characters.
So one "file" can normally only be in one encoding - this includes
the perl script. Unicode and UTF-8 are designed to avoid this problem
because UTF-8 can represent any Unicode code point and there 
are Unicode code points for (almost) all characters used by any 
language.  

However older 8-bit encodings like iso-8859-1 and iso-8859-2 pick 
different 256 character subsets. If I recall correctly 

So you cannot just enter 8-bit string litterals in both encodings 
into one perl script, and have perl know what they are directly.
But you can have 

my $spanish = "...";    
my $romanian = "...";
# Note that only one of those can "look right" in an iso-8859-* editor

my $combined = Encode::decode('iso8859-1',$spanish).
               Encode::decode('iso8859-2',$romanian);

You can then "print" the combined string as UTF-8 (or other Unicode 
encoding). But you will then need some way of viewing the Unicode 
file. An editor which can view the UTF-8 file will probably also 
allow you to enter UTF-8 strings directly as well. So you could 
write you script in UTF-8 and avoid the problem.

Note that you cannot (in general) "print" the combined string as
either 8859-1 or 8859-2

>
>Thank you.
>
>
>----- Original Message ----- 
>From: "Nick Ing-Simmons" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Tuesday, April 13, 2004 11:13 AM
>Subject: Re: Decoding more languages
>
>
>> Octavian Rasnita <[EMAIL PROTECTED]> writes:
>> >Hello all,
>> >
>> >I want to transform a text that contains words in more languages (it is a
>> >course for learning a foreign language) in UTF-8.
>> >I have 2 texts, one that contains Romanian and French words, and another
>one
>> >that contains Romanian and Spanish words.
>> >I have seen that I can Encode::decode('ISO-8859-1', $text) the romanian

Re: Decoding more languages

Reply via email to