Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200): > I wrote a simple tokenizer for texts containing Latin9 characters.
You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is the encoding. It's worth while making a distinction here. > When I run the program on a Mac Snow Leopard, with version 5.8.8 on > the text encoded in UTF-8, Perl outputs a defective UTF-8 code for > this character: <BB> As someone said, given correct input acquisition, the output depends on your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)" for a UTF-8 terminal. > ### An elementary tokenizer. Save it in UTF-8 Then it should have "use utf8" indeed. > __BEGIN > > while ($line = <>) { > $text .= $line; > } If your data is text in UTF-8, you'll want to set your filehandle (STDIN in this case) to UTF-8: binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct > $text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs; > # The dash character must be quoted > $text =~ s/([,.?!:;()'\-])/\n$1\n/g; > $text =~ s/\n+/\n/g; > print $text; I don't know your goal, but consider that \w in a regular expression works fine to catch words: \,,,/ (o o) ------oOOo-(_)-oOOo------ use strict; use warnings; my $fn = shift or die "Datei!"; # file in some encoding open my $fh, '<encoding(iso-8859-15)', $fn or die "open $fn: $!"; my $txt = do { local $/; <$fh> }; close $fh; binmode STDOUT, ':utf8'; # UTF-8 terminal print $txt; my @words = $txt =~ m/\w+/gms; print $_, "\n" for @words; -- Michael Ludwig