Respectful Gentoo developers, I would like to ask what do you think about UTF-8 encoded manual pages? I mean, the files like ls.1.gz, which are used by honorable "man" program. Recently I attacked the problem a little and before submitting any patches/proposals to Gentoo bugzilla I'd like to know your opinions first.
Disclaimer: for daily use I have LANG="pl_PL.UTF-8" and LC_ALL="pl_PL.UTF-8", but the original issue is of a more universal nature. Back on subject. ISO-8859-* 8-bit encodings are fine and most localized manuals use them. However, there are some examples where UTF-8 manuals are installed as well. Namely, newest portage uses "linguas_pl" by this means: $ emerge -pv portage [ebuild R ] sys-apps/portage-2.1_rc3-r3 USE="-build -doc" LINGUAS="pl" In effect, a translated manual pages are added to the system. The problem is that they use UTF-8 encoding. Having both man-pages-pl and this version of portage installed gives unexpected results. This way "man ls" prints all the letters with correct encoding, but "man emerge" does not. On the other hand, if "man" is configured to display UTF-8 encoded manuals correctly, all the other manuals print funny characters instead of desired output. I wrote a simple script [1] which checks all installed Polish manuals by using "file" program. For "pl" locale it produces currently about ~70kB of text, and for default locale it's about 458kB. After grepping for all occurences of "UTF" I've found out that only the newest portage's manuals are in UTF-8 ("pl"), plus: flow.1, gnome-keyring-manager.1, ImageMagick.1, Encode::Unicode::UTF7.3pm (but I think they are false positives, anyway). While it's easy to contact Polish translators of the portage's manuals so they could correct them, the problem will have to be solved sooner or later. UTF-8 encoded manuals will probably occur with higher frequency, and some general resolution should be made. After some discussion on the Polish forum [2] I've learnt about groff deficiencies with UTF-8 handling. However, a wrapper exists [3] that helps somewhat in that matter. But it also requires that all manuals be unified wrt. encoding: *all* ISO-8859-* or *all* UTF-8, no compromise. So I don't know what course to take. Summing up: * UTF-8 manuals: good or bad? * how to handle mixed encodings of manuals? * should man and/or groff handle UTF-8 better? * should an eclass function be created to aid in correcting the encoding of manual pages while installing them? Any constructive comments are more than welcome! Best regards, Wiktor Wandachowicz (SirYes) [1] http://ics.p.lodz.pl/~wiktorw/gentoo/checkman [2] http://forums.gentoo.org/viewtopic-p-3352287.html [3] http://hoth.amu.edu.pl/~d_szeluga/groff-utf8.tar.bz2 -- gentoo-dev@gentoo.org mailing list