On Thu, Jul 24, 2003 at 04:24:51PM +0200, Michael Piefel wrote: > Am 24.07.03 um 15:55:43 schrieb Aaron Isotton: > > what are man pages, or more generally, groff documents, supposed to be > > encoded in? I didn't find any reference to that in groff(7). Is it > > ASCII? > > Preferably ASCII, yes. I seem to remember having once read that the > input actually is in Latin-1. There, 0xA0 is a non-breaking space. > I wouldn't rely on this, though. Many manpages are in Latin-2, since man > used to not do any conversion at all, assuming input would be in the > same encoding as output.
Heh. Well, that's kind of still true under some circumstances. The following is from src/encodings.c in man-db CVS, and is my attempt to hit as many of the encoding problems with current groff as I can with a very large hammer. Feel free to vomit after reading it. /* Due to historical limitations in groff (which may be removed in the * future), there is no mechanism for a man page to specify its own * encoding. This means that each national language directory needs to carry * with it information about its encoding, and each groff device needs to * have a default encoding associated with it. Out of the box, groff * formally allows only ISO-8859-1 on input; however, patches originating * with Debian and imported by many other GNU/Linux distributions change * this somewhat. * * Eventually, groff will support proper Unicode input, and much of this * horror can go away. * * Do *not* confuse source encoding with groff encoding. The encoding * specified in this table is the encoding in which the source man pages in * each language directory are expected to be written. The groff encoding is * determined by the selected groff device and sometimes also by the user's * locale. * * This table is expected to change over time, particularly as man pages * begin to move towards UTF-8. Feel free to patch this for your * distribution; send me updates for languages I've missed. * * Explicit encodings in the directory name (e.g. de_DE.UTF-8) override this * table. TODO: Implement this. */ static struct { const char *lang_dir; const char *source_encoding; } directory_table[] = { { "C", "ISO-8859-1" }, /* English */ { "POSIX", "ISO-8859-1" }, /* English */ { "da", "ISO-8859-1" }, /* Danish */ { "de", "ISO-8859-1" }, /* German */ { "en", "ISO-8859-1" }, /* English */ { "es", "ISO-8859-1" }, /* Spanish */ { "fi", "ISO-8859-1" }, /* Finnish */ { "fr", "ISO-8859-1" }, /* French */ { "ga", "ISO-8859-1" }, /* Irish */ { "is", "ISO-8859-1" }, /* Icelandic */ { "it", "ISO-8859-1" }, /* Italian */ { "nl", "ISO-8859-1" }, /* Dutch */ { "no", "ISO-8859-1" }, /* Norwegian */ { "pt", "ISO-8859-1" }, /* Portuguese */ { "sv", "ISO-8859-1" }, /* Swedish */ #ifdef MULTIBYTE_GROFF /* These languages require a patched version of groff with the * ascii8 and nippon devices. */ { "cs", "ISO-8859-2" }, /* Czech */ { "hu", "ISO-8859-2" }, /* Hungarian */ { "ja", "EUC-JP" }, /* Japanese */ { "ko", "EUC-KR" }, /* Korean */ { "pl", "ISO-8859-2" }, /* Polish */ { "ru", "KOI8-R" }, /* Russian */ #endif /* MULTIBYTE_GROFF */ { NULL, NULL } }; /* The default groff terminal output device to be used is determined based * on nl_langinfo(CODESET), which returns the character set used by the * current locale. */ static struct { const char *locale_charset; const char *default_device; } charset_table[] = { { "ANSI_X3.4-1968", "ascii" }, { "ISO-8859-1", "latin1" }, { "UTF-8", "utf8" }, #ifdef MULTIBYTE_GROFF { "EUC-JP", "nippon" }, #endif /* MULTIBYTE_GROFF */ { NULL, NULL } }; static const char *fallback_locale_charset = "ANSI_X3.4-1968"; static const char *fallback_default_device = #ifdef MULTIBYTE_GROFF "ascii8" #else /* !MULTIBYTE_GROFF */ "ascii" #endif /* MULTIBYTE_GROFF */ ; /* The encoding used for the text passed to groff is a function of the * selected groff device. Traditional devices expect ISO-8859-1 on input * (yes, even the utf8 device); devices added in the Debian multibyte patch * expect other encodings. The ascii8 device passes top-bit-set characters * straight through so is (probably ...) encoding-agnostic. If this encoding * does not match the source encoding, an iconv pipe is used (if available) * to perform recoding. * * Setting less_charset to latin1 tells the less pager that characters * between 0xA0 and 0xFF are displayable, not that its input is encoded in * ISO-8859-1. TODO: Perhaps using LESSCHARDEF would be better. */ static struct { const char *roff_device; const char *roff_encoding; const char *less_charset; } device_table[] = { { "ascii", "ISO-8859-1", "ascii" }, { "latin1", "ISO-8859-1", "latin1" }, { "utf8", "ISO-8859-1", "utf-8" }, #ifdef MULTIBYTE_GROFF { "ascii8", NULL, "latin1" }, { "nippon", "EUC-JP", "ja" }, #endif /* MULTIBYTE_GROFF */ { NULL, NULL, NULL } }; static const char *fallback_roff_encoding = "ISO-8859-1"; static const char *fallback_less_charset = "latin1"; -- Colin Watson [EMAIL PROTECTED]