Pablo Saratxaga wrote on 2001-05-03 12:20 UTC:
> > The combination of "man" (version 1.5h) and "groff" (GNU troff version
> > 1.16.1) is seriously broken in a UTF-8 locale. Even for ASCII only web
> > pages, groff inserts Latin-1 SHY bytes, which result in an ugly
> > malformed UTF-8 sequence. It is very disappointing that this doesn't
> > work correctly out-of-the-box, because the underlying groff mechanics
> > for UTF-8 output is already in place and seems to work correctly:
> 
> The problem is primarly that the source of the man pages are not in utf-8.

No, this is not at all the problem! Groff -Tutf8 on a non-utf8 file
produces already perfectly nice UTF-8 files from ASCII man pages.

I want to see first of all ASCII English man pages to look correctly in
UTF-8 mode. These break already. It is far too early to worry about
non-English man pages at this time.

> That is, the man page viewers have to be modified in order to be able
> to convert encodings.

At the moment, you get garbage with English ASCII. EUC-JP support for
groff is a completely different topic, because PostScript doesn't
support Japanese without additional fonts, etc. Let's first of all get
the normal standard PostScript repertoire supported from English ASCII
groff input.


> That is a client side problem.
> 
> >   zcat /usr/share/man/man7/groff_char.7.gz | groff -mandoc -Tutf8 - | less
> > 
> > produces the desired results, whereas
> > 
> >   man groff_char
> > 
> > does not.
> 
> man has to get patched.

I think, only /etc/man.config and groff have to be patched here. "man"
should not have to worry about the locale, it is just a simple wrapper
to find files and pass them through groff to less. It should never touch
the character encoding.

The main problem is that groff lacks a simple command-line option to
produce locale-encoded plaintext output that does the right thing
whether you are in an ascii, latin1 or utf-8 locale.

> But, does using groff -Tutf8 on a non-utf8 file converts it to utf-8?

Sure. It inserts lots of UTF-8 symbols such as soft hyphens, directional
quotation marks, etc. Try

  zcat /usr/share/man/man7/groff_char.7.gz | groff -mandoc -Tutf8 - | less

to see the entire repertoire of characters that groff supports in
PostScript output on your terminal!

> In other words, the -T parameter tells the encoding of the file or the
> encoding to use in the output?

-T determines the output. The input is always in ASCII (just like in TeX).

> > The required fix here is that groff should get a new output device
> > -Tplaintext which specifies plaintext encoded according to the current
> > locale (just query nl_langinfo(CODESET) and see whether it says "UTF-8"
> > or "ISO-8859-*" or something like that). Then in /etc/man.config, we
> > could simply replace
> > 
> >   NROFF           /usr/bin/groff -Tlatin1 -mandoc
> > 
> > with
> > 
> >   NROFF           /usr/bin/groff -Tplaintext -mandoc
> > 
> > and man would automatically work properly in both ISO-8859 and UTF-8
> > locales.
> 
> Have you tested that idea with Russian or Japanese man pages?
> Eg: man pages in koi8-r displayed under an UTF-8 locale.

The standard postscript fonts do not support Russian or Japanese, so
what is the point?? Please remember that groff is primarily a tool to
produce formatted PostScript output. The text output used by man pages
is just a spin-off, that is at the moment not really suited for any
languages not covered by the Postscript standard encoding.

Sure, there are hackers who write man pages in other 8-bit encodings or
even EUC, and groff pipes it with moderate pain through its formatting
routines, ignorant of the nature of the characters it handles. But that
hack only works for the plaintext output, not generically for PostScript
or proper HTML. Groff (and thus man) currently do not support non-Latin
scripts and existing Russian and Japanese man pages are mostly an
unprintable illusion.

> I'm afraid your solution will work only for plain ascii pages.

I'm afraid, groff is currently really not designed for anything else.
This might change in the future if we get a completely redesigned groff
that uses UCS internally, but that's scifi so far. Don't interpret
anything into the existence of hacked KOI8-R man pages. There are no
Cyrillic characters listed in man groff_char!

> perl -e 'use utf8; print "\x{20ac}\b\x{20ac}\x{2203}\b___\n"' | less
> 
> works.

Which is definitely not how it should work. Less has to understand that
\b moves back on a terminal one character, not one byte.

> > Summary: Red Hat 7.1 is not even suited to make a 5 min demonstration of
> > its UTF-8 locale support without serious embarrassment. xterm is pretty
> > much the only UTF-8 application that works at the moment.
> 
> No, there are several other applications; the most annoying thing is the broken
> fontset support at XFree86 level, if that were fixed then automatically
> all programs using fontsets (eg: all of Gnome, Windowmaker, etc) will start
> displaying nice unicode out of the box.

Has this been identified and fixed in the XFree86 4.1 snapshot?

> There is also another bug, in 'ls'.
> Try a 'touch somefile' with a utf8 name, then doing an ls; you will see
> only '?'.

I can't reproduce that problem under RH 7.1. "ls" seems to work
correctly. I produced a number of files with (normal-width) non-ASCII
characters, and their display and column arrangement looks very nice.
Are you sure you have successfully selected an existing UTF-8 locale
(setlocale() didn't return an error?).

Perhaps we should produce a reference tar file with a large directory
structure consisting of UTF-8 names for such testing.

When I just tried to enter 

  touch 
äääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääääää

to test ls, it also became apparent that in RH 7.1 bash (and readline?)
break in a UTF-8 locale.

> Is it ok to post patches in this ml ? 

Sure. (And probably also to [EMAIL PROTECTED])

Does your patch also do biwidth output correctly?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to