Re: Workaround to a unicode bug needed

Pierre Nugues Mon, 06 Sep 2010 13:27:24 -0700

Dear Michael,

> Pierre Nugues schrieb am 06.09.2010 um 11:09 (+0200):
>> I wrote a simple tokenizer for texts containing Latin9 characters.
> 
> You probably mean "non-ASCII characters". Latin9 alias ISO-8859-15 is
> the encoding. It's worth while making a distinction here.
I meant what I wrote. The text contains a subset of the Latin9 characters. The 
list I consider in tr// is:
a-zåàâäæçéèêëîïôöœßùüûÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ


It is not a subset of Latin1, because of œ and Œ. 
The encoding of the text is UTF-8

>> When I run the program on a Mac Snow Leopard, with version 5.8.8 on
>> the text encoded in UTF-8, Perl outputs a defective UTF-8 code for
>> this character: <BB>
> 
> As someone said, given correct input acquisition, the output depends on
> your binmode setting and your terminal. Use ":utf8" or "encoding(UTF-8)"
> for a UTF-8 terminal.
I tried different combinations of binmode input and output and none of them 
works. Here is my terminal configuration:
Pierres-MBP-3:ch04 pierre$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=

The setting LC_ALL='C' does not work either.

>> ### An elementary tokenizer. Save it in UTF-8
> 
> Then it should have "use utf8" indeed.
I tried this too.

> If your data is text in UTF-8, you'll want to set your filehandle
> (STDIN in this case) to UTF-8:
> 
>  binmode STDIN, 'encoding(UTF-8)'; # 'utf8' also works but less correct
This does not work either.

> I don't know your goal, but consider that \w in a regular expression
> works fine to catch words:
I know there are work around, but I would like to find one for tr//
Pierre

Re: Workaround to a unicode bug needed

Reply via email to