Re: utf8 Problems

Florian Kulzer Mon, 30 Jul 2007 11:33:59 -0700

On Sat, Jul 28, 2007 at 18:31:19 +0200, Bernhard Kuemel wrote:
> Hi debian-user!
> 
> I converted to utf8 in the hope that my non ASCII character problems
> would disappear. They are now ... different.


Having your system on utf8 means that you should be able to display and
use all sorts of special characters, limited only by the fonts you have
installed. However, you still have to convert texts which are in
different encodings, especially since some applications do not declare
the used encoding properly. Often it is not even possible to declare the
encoding in a standardized way, for example in plain text files.
 
> I used utf8migrationtool and locale now says:
> 
> [EMAIL PROTECTED]:~$ locale
> LANG=en_US.UTF-8

[ snip: The rest is en_US.UTF-8, too. ]

> I am in Austria, where we speak German, but I chose en because the
> German translations are often so ridiculous (in mc's config:
> 'verbose operation' gets 'redselige Vorgaenge', bash says 'getoetet'
> instead of 'killed', when a process get's killed).
> 
> I chose US because I tought that was most used and thus most stable.

I do the same because I also do not like the translations; they just
seem stilted and unnatural to me. You can still use other locales for
some things, for example to have day-month-year dates, sensible paper
sizes, and POSIX numbers/sorting:

$ locale | grep -v en_US
LC_NUMERIC=POSIX
LC_TIME=en_GB.UTF-8
LC_COLLATE=POSIX
LC_PAPER=de_DE.UTF-8
LC_ALL=

You can mix and match as you want, provided that you generated the
necessary locales. (dpkg-reconfigure locales)

> Now the problems:
> 
> I wanted to print a German text containing umlauts from a web page.
> I marked it in iceweasel and pasted it into a 'konsole' running bash
> running 'cat >x'. 'lpr x' printed only a page with the character 'K'.
> 
> 'hexdump -C x' says:
> 
> 00000010  20 20 20 20 20 20 4b fc  6e 64 69 67 75 6e 67 73  |
> K.ndigungs|
> 00000020  62 65 73 63 68 72 e4 6e  6b 75 6e 67 65 6e 0a 0a
> |beschr.nkungen..|
> 
> so &uuml; is 0xfc, &auml; is 0xf4, and the characters are printed as
                               0xe4 (typo)
> periods '.'.

Those are the iso8859-1 codes for "ü" and "ä":

$ echo -n "üä" | iconv -f utf8 -t iso8859-1 | hd
00000000  fc e4                                             |..|
00000002

Here I used "iconv" to convert the uft8 output of echo to the iso8859-1
encoding. You can use it the other way round to convert the text from
the website to utf8:

iconv -f iso8859-1 -t utf8 | icat >x

It is probably possible to omit the "-t utf8" declaration since iconv
should just consult the locale setting if "to" or "from" are not
explicitly specified.

Other common encodings to try in such cases are iso8859-15 (if the text
has Euro symbols) and cp1250 (a Microsoft codepage). "iconv -l" will
give you a full listing of all supported encodings.

> mc's viewer says:
> 
> 00000010 20 20 20 20  20 20 4B FC  6E 64 69 67  75 6E 67 73
> KÃ¼ndigungs
> 00000020 62 65 73 63  68 72 E4 6E  6B 75 6E 67  65 6E 0A 0A
> beschrÃ¤nkungen..
> 
> Here &uuml; is still only the single byte 0xFC, but it gets printed
> as 'A' with a tilde and a '1/4' character. &auml is again 0xE4 but
> printed as 'A' with a tilde and a circle with 4 short lines
> extending from the circle diagonally.

That is very strange: Two-character sequences like "Ã¼" are normally the
symptom of interpreting an utf8-encoded special character as if the text
was in an iso* encoding instead. (utf8 uses multi-byte sequences for
special characters.)

$ echo "üä" | iconv -f iso8859-1 -t utf8
Ã¼Ã¤

(To demonstrate this, I deliberate told iconv to misinterpret the uft8
 output of echo as being in iso8859-1.)

It looks like mc recognizes that there are iso8859-1 encoded characters,
converts them internally to utf8 and then fails in the last step by
interpreting the converted characters as iso8859-1 again. This is either
a bug of mc or a configuration problem.

> Opening x in openoffice writer shows rhombuses with question marks
> for each umlaut.

That is the proper reaction: Your text has bytes in it which make no
sense in utf8 and therefore a "this makes no sense" placeholder is
displayed. Your terminal should react in the same way:

$ echo "üä" | iconv -f utf8 -t iso8859-1
��

> Opening x.html in openoffice writer I was unable to remove all the
> table etc. stuff and so was unable to reformat the text so it would
> fit on one page. Hmm, it might work, if I copied the text from there
> into a new document. But here I want to solve the locale problems,
> or what should I call the problem?

It is an encoding problem. The webpage does not declare it properly, the
browser screws it up or the copy/paste mechanism gets confused. The
safest approach is to save the page as html or txt (depending on whether
you want to preserve some formatting) and convert it to utf8 with iconv;
some trial-and-error might be necessary to figure which encoding was
used. (If you absolutely want to rule out screw-ups of your browser then
you have to use wget to download the page.)
 
> mc (midnight commander, a norton commander clone) of course goes
> crazy again, but I was not surprised and accepted that it prints 'a'
> with '^' instead of line art, etc. More serious was that when I
> 'ssh'ed to a different computer (not sure which) it got confused
> about which line it was on and I messed up editing /etc/fstab.

Regarding mc: see above. The ssh problem might indicate that you do not
set all environmental variables properly on the remote machine.
 
> man gets quote characters wrong, printing 'a' with '^' instead and
> so does gcc.

Can you give an example or do you see this with all manpages?
 
> I also have problems with kvirc. IIRC I can get it to display
> iso8859-1 correctly, but not utf8, and the smart utf8/iso8859-1 mode
> does not work. I chat with users who use iso8859-1 and utf8.

Maybe your KDE does not get the proper locale setting. (It does not read
all the shell initialization files.) Does it help if you start kvirc
from the konsole?
 
> Is there a package which is responsible for all these problems so I
> can file a bug report against it? Or are these bugs in konsole, gcc,
> man, bash, mc, iceweasel, openoffice and kvirc? Or ... is the bug
> sitting in front of the computer again :)?

You may have a screw-up with environmental variables somewhere. What is
in your /etc/default/locale configuration file? Do you still have an old
/etc/environment hanging round?

> I wonder if it's easier to set up debian from scratch.

Normally that should not be necessary. When I migrated my laptop to utf8
I simply generated the new *.UTF-8 locales and set LANG and the LC_*
environmental variables accordingly. (I never use special characters in
filenames and that saved me a lot of work, of course.)

[...]

-- 
Regards,            | http://users.icfo.es/Florian.Kulzer
          Florian   |

Re: utf8 Problems

Reply via email to