Respectful Gentoo developers,

I would like to ask what do you think about UTF-8 encoded manual pages?
I mean, the files like ls.1.gz, which are used by honorable "man" program.
Recently I attacked the problem a little and before submitting any
patches/proposals to Gentoo bugzilla I'd like to know your opinions first.

Disclaimer: for daily use I have LANG="pl_PL.UTF-8" and LC_ALL="pl_PL.UTF-8",
but the original issue is of a more universal nature.

Back on subject. ISO-8859-* 8-bit encodings are fine and most localized
manuals use them. However, there are some examples where UTF-8 manuals are
installed as well. Namely, newest portage uses "linguas_pl" by this means:

$ emerge -pv portage
[ebuild   R   ] sys-apps/portage-2.1_rc3-r3  USE="-build -doc" LINGUAS="pl"

In effect, a translated manual pages are added to the system. The problem
is that they use UTF-8 encoding. Having both man-pages-pl and this version
of portage installed gives unexpected results. This way "man ls" prints all
the letters with correct encoding, but "man emerge" does not. On the other
hand, if "man" is configured to display UTF-8 encoded manuals correctly,
all the other manuals print funny characters instead of desired output.

I wrote a simple script [1] which checks all installed Polish manuals by
using "file" program. For "pl" locale it produces currently about ~70kB
of text, and for default locale it's about 458kB. After grepping for all
occurences of "UTF" I've found out that only the newest portage's manuals
are in UTF-8 ("pl"), plus: flow.1, gnome-keyring-manager.1, ImageMagick.1,
Encode::Unicode::UTF7.3pm (but I think they are false positives, anyway).

While it's easy to contact Polish translators of the portage's manuals so
they could correct them, the problem will have to be solved sooner or later.
UTF-8 encoded manuals will probably occur with higher frequency, and some
general resolution should be made.

After some discussion on the Polish forum [2] I've learnt about groff
deficiencies with UTF-8 handling. However, a wrapper exists [3] that helps
somewhat in that matter. But it also requires that all manuals be unified
wrt. encoding: *all* ISO-8859-* or *all* UTF-8, no compromise.
So I don't know what course to take.

Summing up:
* UTF-8 manuals: good or bad?
* how to handle mixed encodings of manuals?
* should man and/or groff handle UTF-8 better?
* should an eclass function be created to aid in correcting the encoding
  of manual pages while installing them?

Any constructive comments are more than welcome!

Best regards,
Wiktor Wandachowicz
(SirYes)

[1] http://ics.p.lodz.pl/~wiktorw/gentoo/checkman
[2] http://forums.gentoo.org/viewtopic-p-3352287.html
[3] http://hoth.amu.edu.pl/~d_szeluga/groff-utf8.tar.bz2


-- 
gentoo-dev@gentoo.org mailing list

Reply via email to