Re: UTF-8 out-of-the box experience

Markus Kuhn Thu, 03 May 2001 12:03:28 -0700
Pablo Saratxaga wrote on 2001-05-03 18:29 UTC:
> > No, this is not at all the problem! Groff -Tutf8 on a non-utf8 file
> > produces already perfectly nice UTF-8 files from ASCII man pages.
> 
> For ascii yes :) as it happens to be invariant in this case.

No, it is not invariant. English ASCII text is formatted using lots of
lovely non-ASCII Unicode characters with -Tutf8 to get the best possible
UTF-8 approximation of the original Postscript output.

> I thought about french, russian, etc pages.

French is no problem. It is fully supported by groff and the standard
PostScript encoding. You write \('e for é, \(oe for ligature oe, etc. in
French groff input and this is translated into the correct PostScript or
Latin1 or UTF-8 output. There is *no* way to write Cyrillic in groff at
the moment. What some people unfortunately use is the silly hack of
blindly passing through 8-bit characters and hoping that the receiving
text terminal uses the same encoding, which works only for plain text
output anyway but has nothing to do with proper formatted output.
Russian typesetting is impossible with groff at the moment.

> I didn't mean postscript (-Tps), but simply that plain tty output works
> for other thing than ascii.

That's not the primary function of groff. A proper manpage must be
proofread in Postscript in the end, as that is how it gets into the
manual, and not just previewable on a text terminal.

> Well, I wonder. It seems nowadays it is used (at least for non English text)
> primarly for man page online formatting.

That's perhaps what you do. I still print out many man pages for myself
and my students. When I write man pages, I test whether they look ok if
printed with Postscript, because that is the original form of the man
pages. It is the same classic layout that the Unix manuals from
commercial vendors always came in, which were printed with troff. The
X11 documentation is printed as Postscript with troff. You can't claim
that you have an i18n version of groff before that doesn't also work for
all your target languages. Nroff is really just a quick&dirty on-screen
preview tool, nothing more.

> > > perl -e 'use utf8; print "\x{20ac}\b\x{20ac}\x{2203}\b___\n"' | less
> > > 
> > > works.
> > 
> > Which is definitely not how it should work. Less has to understand that
> > \b moves back on a terminal one character, not one byte.
> 
> The problem is not with \b, that bit worked with the patch I had.
> The problem is the 'underline' property only applied to the first *byte*
> of the char to be underlined.

You want to underline all three bytes of the UTF-8 character separately? :-)

[ls bug]
> It apparently is fixed in fileutils 4.1;

RH 7.1 comes with "ls (GNU fileutils) 4.0.36" and that works nicely for
me so far in UTF-8 locales.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: UTF-8 out-of-the box experience

Reply via email to