Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200):
> I wrote a simple tokenizer for texts containing Latin9 characters.

You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is
the encoding. It's worth while making a distinction here.

> When I run the program on a Mac Snow Leopard, with version 5.8.8 on
> the text encoded in UTF-8, Perl outputs a defective UTF-8 code for
> this character: <BB>

As someone said, given correct input acquisition, the output depends on
your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)"
for a UTF-8 terminal.

> ### An elementary tokenizer. Save it in UTF-8

Then it should have "use utf8" indeed.

> __BEGIN
> 
> while ($line = <>) { 
>    $text .= $line;
> }

If your data is text in UTF-8, you'll want to set your filehandle
(STDIN in this case) to UTF-8:

  binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct

> $text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
>   # The dash character must be quoted
> $text =~ s/([,.?!:;()'\-])/\n$1\n/g;
> $text =~ s/\n+/\n/g;
> print $text;

I don't know your goal, but consider that \w in a regular expression
works fine to catch words:

          \,,,/
          (o o)
------oOOo-(_)-oOOo------
use strict;
use warnings;
my $fn = shift or die "Datei!"; # file in some encoding
open my $fh, '<encoding(iso-8859-15)', $fn or die "open $fn: $!";
my $txt = do { local $/; <$fh> };
close $fh;
binmode STDOUT, ':utf8'; # UTF-8 terminal
print $txt;
my @words = $txt =~ m/\w+/gms;
print $_, "\n" for @words;

-- 
Michael Ludwig

Reply via email to