Guillem Jover <guil...@debian.org> writes: > Using raw UTF-8 in the roff source is not portable, and some (most?) > implementations might not be happy about that. But using the escape > sequences should always be safe(?). (I've just verified at least on AIX > and Mac OS X systems.)
Internationalization of man pages has a bunch of irritating problems that come down to picking which non-portable problem you want to have. groff macros are portable to various different levels of maturity around Unicode handling... but not to *roff implementations other than groff, as most of the macros used for Unicode characters seem to be groff inventions not present in traditional UNIX *roff implementations. Using Unicode directly in the *roff source is probably more portable these days, since groff seems to handle it acceptably and I suspect more *roff implementations handle that than handle groff-specific escapes. But I no longer have access to a wide variety of traditional UNIX platforms to check. I know that eight-bit characters in *roff source caused serious problems (segfaults, etc.) on very old *roff implementations on proprietary UNIXes (Solaris 2.4, that sort of thing), which is why I've always avoided using that approach with the output of pod2man without a special flag (-u). But I'm not sure it makes sense to still be that cautious, and the default output of pod2man is awful (replacing all non-ASCII characters with X, which just isn't acceptable any more). Various people have asked for a groff macro output mode, and I think that would be a fine idea, except that it requires some effort to build the large table of Unicode code point to groff macro mappings. I'm not sure if it makes sense to have that be the default output mode or to have raw Unicode be the default output mode (I want to get rid of the current default). It sounds like from your portability investigation that using groff macros as the default output mode might work, which is valuable information! Needless to say, if anyone wanted to put together the mapping table to enable that, I would be very interested. I'll add it to my personal to-do list, but that's quite long and the time I have available to work on free software at the moment is sadly limited. > But coming back to the source code, yes, I pretty much agree that roff > can be very noisy and non-readable, to the point I've actually gotten > bothered enough to check for possible alternatives this last month. The > problem is finding a format that is clear, expressive enough, supported > by po4a, does not require huge Build-Depends and produces portable and > nicely formatted man pages. The obvious candidate is perl's POD, because > we are already using that for the perl modules and require perl to > build. > But I've found some quirks and issues that while not unsurmountable, > might need to be looked at first and perhaps fixed or workarounds found > to avoid "regressions", and I'm not sure which ones Russ would be happy > to get bug reports for? :) I'm definitely happy to get bug reports! I do try to slowly work through issues like this (for instance, I've now added separate flags to control the left and right quote marks, from a bug report you filed quite some time ago). Obviously, patches make things even faster, and I'm slowly trying to modernize and improve the coding style of the podlators code, although it's a rather long process. > I'm attaching a PoC conversion (can be tested with «pod2man > deb-symbols.pod|man -l -», and is available also from [G]) and here's a > list of potential differences/issues: > - References are in italic not bold. I can change this (a bug report to remind me to do so is very welcome). For the record, italics actually used to be the correct convention somewhere (I know I didn't make that up), probably Solaris since I took a lot of the conventions from there, but I see that man-pages(7) now recommends bold. This is one of those things that was never standardized, but at this point I think the Linux man-pages Project is sufficiently widespread and authoritative that, as long as it's not in complete disagreement with BSD, I'm happy to go with their conventions. Particularly over old Solaris conventions, since Solaris is now mostly dead. > - Does not map ‘’, “”, and other UTF-8 quotes to roff escape sequences > (or have to use non-portable --utf8 option). See above for a rather extended discussion of that. > - Needs raw roff for some formatting, as POD is not expressive enough > (this will have to do with «=begin man» as pod2man cannot change > the POD syntax anyway). Yes. POD is sadly a somewhat limited syntax, and while there was a Perl 6 take on POD that was trying to expand it, I don't think it ever caught on. These days, everyone seems to have switched to Markdown or reStructured Text, which certainly have their merits but which don't seem to be good fits for man page generation. So, for things like tables, you're probably going to need to continue to escape to raw *roff with =begin man. > - Many minus signs are output as hyphens (for example for field names). This is a nasty problem, since POD has no explicit markup for this and one has to use heuristics. Improvements in the heuristics are certainly welcome. This is the current code: # By the time we reach this point, all hyphens will be escaped by adding a # backslash. We want to undo that escaping if they're part of regular # words and there's only a single dash, since that's a real hyphen that # *roff gets to consider a possible break point. Make sure that a dash # after the first character of a word stays non-breaking, however. # # Note that this is not user-controllable; we pretty much have to do this # transformation or *roff will mangle the output in unacceptable ways. s{ ( (?:\G|^|\s) [\(\"]* [a-zA-Z] ) ( \\- )? ( (?: [a-zA-Z\']+ \\-)+ ) ( [a-zA-Z\']+ ) (?= [\)\".?!,;:]* (?:\s|\Z|\\\ ) ) \b } { my ($prefix, $hyphen, $main, $suffix) = ($1, $2, $3, $4); $hyphen ||= ''; $main =~ s/\\-/-/g; $prefix . $hyphen . $main . $suffix; }egx; As you can see, it's a bunch of messy and rather fragile regexes. But there is a test system, so I'm happy to tweak these and add more tests if you have specific use cases that you encounter. The trick is going to be distinguishing between hyphenated English words (which should use the unmarked - character in *roff source) and field names where you want an explicit \- minus sign. Although I could see an argument for just supporting disabling this heuristic if one doesn't care about good line wrapping. > - Default for pod2man is no justified text. This is a (very strong) personal preference, since I think most man pages are read on terminals with fixed-width fonts, and I think justified text looks awful in a fixed-width font. But I'd be happy to add a non-default flag that suppresses the turning off of justification. > - The license blurb is only present as a comment on the source. Yeah, I've given up on this and just put the license in a section of the output of the man page, but I think it would be lovely to put it in a comment. This probably requires some sort of =for license block (and I'm not sure what Pod::Text should do with it -- just suppress it entirely, I guess). I'm happy to add support for this (obviously, patches even more welcome). -- Russ Allbery (r...@debian.org) <http://www.eyrie.org/~eagle/>