[ Colin CCed for some input on groff vs minus situation. ] On Thu, 2016-10-27 at 17:10:59 -0700, Russ Allbery wrote: > Guillem Jover <guil...@debian.org> writes: > > For the current conversion in dpkg, I've taken most of the common > > symbols from groff_char(7) and created a very simple sed script, I'm not > > sure if you were thinking about something along those lines (although in > > proper perl)? > > > > > <https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/tree/man/utf8toman.sed?h=next/master&id=c07b9b79447e200645ea423f959194fcbf8d4d32> > > Yeah, that would work, although aren't there quite a few more sequences > than that? Does groff have a way of representing an arbitrary Unicode > code point?
Ah right, indeed it does. And it's explained in that same man page I referred. O:) The escape sequence would be something like \[u0021] or \[u0041_0300]. > For Pod::Man usage, the output format I'd want would be a hash mapping > Unicode code points to the correct groff escape. Or, in an absolutely > ideal world, to have an Encode encoding for groff escapes, similar to how > the Encode::MIME::Header encoding works to generate RFC 2047 strings. I happened to stumble over an old patch by Brendan O'Dea that might be helpful, including a reference here to not lose track of that: <https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=442066;filename=groff-utf8;msg=22> > > If you could specify exactly which symbols you'd like to see supported I > > might take a stab at this, when I have some spare time. Say everything > > in groff_char(7) or similar. :) > > As much as possible is of course ideal, but I'm happy to take partial > work! :) Ok! :) > > The other major issue are commands, which I'm not sure are so easy to > > detect. Maybe they could get to use the \- minus if they are inside some > > other markup. I see that C<some-command> escapes them, as does > > L<some-command(1)>, but L<some-command> does not (any reason?), which > > could be handy to use I guess. Filenames are also safe with > > F</some-dir/file-name>. The only problem is using the proper markup that > > also preserves the same output as the current man pages. > > B<> and I<> could just be surrounding normal words that should use normal > hyphens. L<some-command> is a link to a section in the same document > entitled some-command, so the assumption there is also that it could be a > regular English word. Oh, at least perlpod(1) says that L<name> links to a Perl manual page, so I'd expect it to be equivalent to the L<crontab(5)> style when processing minus chars, and L</sec> does the inter-section linking? > As you say, though, I'm not entirely sure the distinction is worth all the > trouble we've put into it over the years. nroff at least seems to have > just given up and maps them all to "-" in the output anyway. That used to > be a Debian-specific change, but it looks like upstream has switched to > treating - as \-, I think? For HTML output, upstream maps \- to − > and Debian still overrides that to - instead. (If upstream thinks \- is a > minus sign and not ASCII 45, I'm really confused what's going on with > this, though.) We should probably ask Colin about this. :) > > I've always found the AUTHORS, COPYRIGHT or LICENSE sections to be > > distracting, and in dpkg we got rid of all of them, because in addition > > they were getting usually out-of-sync with the actual copyright > > statements, and required adding names and updating years in two places. > > Yeah, that part is irritating. The alternative, which I use in my > packages these days, is to have these reflect the authors, copyright, and > license of the *manual page*, but that's also weird. Right, that's what dpkg used to have. But even then I've still found this distracting. > =for license, resulting in a comment in the generated man page, seems like > a better general solution (and then it probably makes sense for this to > always reflect the license of the documentation file itself, not the > larger package). Yeah. Thanks, Guillem