Dear Michael, > Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200): >> I wrote a simple tokenizer for texts containing Latin9 characters. > > You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is > the encoding. It's worth while making a distinction here. I meant what I wrote. The text contains a subset of the Latin9 characters. The list I consider in tr// is: a-zåàâäæçéèêëîïôöœßùüûÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ
It is not a subset of Latin1, because of œ and Œ. The encoding of the text is UTF-8 >> When I run the program on a Mac Snow Leopard, with version 5.8.8 on >> the text encoded in UTF-8, Perl outputs a defective UTF-8 code for >> this character: <BB> > > As someone said, given correct input acquisition, the output depends on > your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)" > for a UTF-8 terminal. I tried different combinations of binmode input and output and none of them works. Here is my terminal configuration: Pierres-MBP-3:ch04 pierre$ locale LANG="fr_FR.UTF-8" LC_COLLATE="fr_FR.UTF-8" LC_CTYPE="fr_FR.UTF-8" LC_MESSAGES="fr_FR.UTF-8" LC_MONETARY="fr_FR.UTF-8" LC_NUMERIC="fr_FR.UTF-8" LC_TIME="fr_FR.UTF-8" LC_ALL= The setting LC_ALL='C' does not work either. >> ### An elementary tokenizer. Save it in UTF-8 > > Then it should have "use utf8" indeed. I tried this too. > If your data is text in UTF-8, you'll want to set your filehandle > (STDIN in this case) to UTF-8: > > binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct This does not work either. > I don't know your goal, but consider that \w in a regular expression > works fine to catch words: I know there are work around, but I would like to find one for tr// Pierre