Re: Some UTF-8-related questions

Brian Fraser Wed, 11 Jan 2012 05:47:40 -0800

On Wed, Jan 11, 2012 at 7:59 AM, Hamann, T.D. (Thomas) <
ham...@nhn.leidenuniv.nl> wrote:


> Hi,
>
> Thanks for the answers on my last question. I have since then dug a bit
> further in the UTF-8-related error message I got, and after some reading
> have a few questions with regards to UTF-8 handling in perl:
>
> (Please bear in mind that I am not an IT guy)
>

Worry not -- Basically no IT person gets this right anyway : )


>
> 1a) My use statements are the following:
>
> use warnings;
> use strict;
> use utf8;
> use open ':encoding(utf8)';
>

I would add

use feature qw(unicode_strings);

or even

use if $^V ge v5.12, feature => qw(unicode_strings);

and replace :encoding(utf8) for :encoding(UTF-8), but see below.


>
> Now if I understand it correctly, there's two ways of encoding UTF-8 in
> perl: One liberal (utf8) and one strict (UTF-8). For my purpose, I need
> correctly encoded UTF-8 files. However, I cannot be sure whether the files
> I start with are properly encoded in UTF-8.
>

That's primarily right, but I think that you are mistaken in the usage of
the lax version, utf8. The latter is only useful when reading something
produced by another Perl process that used the lax encoding and outputted
illegal UTF-8.
For example:

use Devel::Peek;
use warnings;

open my $out_fh, ">:utf8", "invalid_UTF-8.txt" or die $!;
say { $out_fh } "This here: [\x{FFFF_FFFF}] is illegal UTF-8, but valid in
Perl's lax internal encoding";
close $out_fh or die $!;

for my $encoding ( qw< utf8 encoding(UTF-8) > ) {
    say "Encoding: [$encoding]";
    open my $in_fh, "<:$encoding", "invalid_UTF-8.txt" or die $!;
    my $line = <$in_fh>;
    Dump $line;
    close $in_fh;
}

What you get depends on whenever $encoding is utf8 or encoding(UTF-8),
though the difference is a bit hard to spot. For the former, you'll get
back the string that you originally printed, but for the latter, Encode
will complain about \x{FFFF_FFFF} not being in Unicode, and give you a
string with a literal \x{FFFFFFFF}, and if you had written it in single
quotes!

The bottom point is that you scarcely ever want the lax, internal form.
Moreso because it's subject to change in upcoming Perl versions, since what
it currently does is whack.


> So is it possible to open a file using the liberal interpretation, and
> write to a new file using the strict interpretation? Are there any issues
> regarding this, like characters that might not be re-encoded properly?
>

See the above example. Should be entirely fine as long as the contents of
the file are all legal UTF-8.


>
> 1b) How can I check whether a file is properly encoded UTF-8?
>

Generally, you don't want to do this manually. If you are using the strict
encoding, it'll throw warnings all over the place, and if you've made
warnings fatal, your program will stop.

However, you can check by not using any implicit encoding, then manually
encoding/decoding as needed:

use Encode qw( encode decode );

open my $in_fh, "<:", $file or die $!;

while ( my $raw_line =<$in_fh> ) {
    my $line = eval { decode "UTF-8", $raw_line, Encode::FB_CROAK };
    if ( $@ ) {
        warn "Line $. ins't valid in UTF-8"
        $line = eval { decode ... # Try another encoding };
    }
    $line = $raw_line if $@; # If we got here with an error, assume default
encoding (latin-1)
    ... # Stuff
}

close $in_fh or die $!;


>
> 2a) As I understand it, Windows has a somewhat limited ability to display
> certain UTF-8 characters, although some fonts can display more of them. The
> characters do exist in the file, even if Windows can't display them
> (besides showing a square). Is this correct? If not, does that impact
> perl's ability to handle Unicode?
>

That's correct. Being able to display a character has no impact on Perl's
ability to process that character.


>
> 2b) Do scripts themselves have to be encoded in UTF-8 to be able to
> process UTF-8-files?


Nope.


> If not, when should you encode the scripts in UTF-8 and when not?


When you are using UTF-8 literals in your code, for example

say "In katakana, [ni] is [ニ]";

or

my $león = "Simba";

In which case the file needs to have a "use utf8;" on top, as well as being
properly encoded in UTF-8.


> Most of my scripts add text to UTF-8 encoded text files. I've noticed that
> this sometimes seems to change the encoding or give error messages when
> e.g. accented characters are involved. Am I right in assuming that only
> scripts that remove text or extract certain parts do not need to be encoded
> in UTF-8?
>

The encoding of the source has basically no relevance whatsoever [*],
unless you are using "use encoding", which you shouldn't. Errors with
accented characters is probably due to using latin-1 and mistakenly
assuming that you are using UTF-8, or the reverse. The likely culprits for
this sort of things are that you forgot to "use utf8", or your editor isn't
outputting UTF-8 (maybe latin-1?), or you are using the wrong encoding for
reading/writing.

[*] Nitpick: Unless you are reading things from a __DATA__ section, which
inherits the UTF8-ness of the file in which it was found.


> 2c) Not really a perl question: Does anyone know of a monospaced font for
> Windows that handles most UTF-8 characters gracefully? I would like one for
> use in Notepad++ to make it easier to write scripts containing special
> characters not normally displayable in Windows.
>

Symbola. It's awesome. \N{DROMEDARY CAMEL}


>
>
> 3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM.


Nope. Windows uses UTF-16, which requires a BOM to distinguish between
UTF-16LE and UTF-16BE. Most Unices use UTF-8, which don't require a BOM
and, in fact, using it is against Unicode's recommendation. If you spot a
file with a UTF-8 BOM, quickly s/// it away!


> A particular script of mine prepends a piece of text to UTF-8 encoded text
> files created with MS Word on Windows (saved as .txt with UTF-8 encoding).
> Unfortunately, this appears to break the encoding, which changes from
> "UTF-8 with BOM" to "UTF-8 without BOM", probably because the text is
> inserted *before* the BOM at the start of the file. How do I prevent this?
> How can my script recognize the BOM at the start of the file?
>

Been a while since I used Word, but I've got a hunch that "UTF-8 with BOM"
is actually marked as "Unicode", which in Windowspeak is UTF-16, and see
the note about the BOM above.

Like mentioned above, you generally -don't- want to read the file and start
guessing encodings. That road leads to madness. It would be helpful if you
posted some snippets of code that showed what and where the problem lies,
that way we could give you a bit more accurate piece of advice. However, if
you absolutely must go on guessing, check out File::BOM and/or
Encoding::Guess, or try manually decoding as shown above.

Re: Some UTF-8-related questions

Reply via email to