tags 584437 + patch pending retitle 584437 roffit: [PATCH] better man page header detection (.TH tag) forwarded 584437 <dan...@haxx.se> thanks
markus schnalke <mei...@marmaro.de> writes: > .TH curl 1 "22 Oct 2003" "Curl 7.10.8" "Curl Manual" > .TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual >From unscientific test: $ ls /usr/share/man/man1/*.gz | xargs zgrep '^\.TH.*\\' | egrep -v '\"' usr/share/man/man1/ci.1.gz:.TH CI 1 \*(Dt GNU /usr/share/man/man1/co.1.gz:.TH CO 1 \*(Dt GNU /usr/share/man/man1/evince-thumbnailer.1.gz:.TH evince\-thumbnailer 1 2007\-01\-15 /usr/share/man/man1/formail.1.gz:.TH FORMAIL 1 \*(Dt BuGless /usr/share/man/man1/gnome-panel.1.gz:.TH gnome-panel 1 2006\-03\-07 /usr/share/man/man1/html2text.1.gz:.TH html2text 1 2008\-09\-20 /usr/share/man/man1/ident.1.gz:.TH IDENT 1 \*(Dt GNU /usr/share/man/man1/join-dctrl.1.gz:.TH join\-dctrl 1 /usr/share/man/man1/lockfile.1.gz:.TH LOCKFILE 1 \*(Dt BuGless /usr/share/man/man1/merge.1.gz:.TH MERGE 1 \*(Dt GNU /usr/share/man/man1/patch.1.gz:.TH PATCH 1 \*(Dt GNU /usr/share/man/man1/procmail.1.gz:.TH PROCMAIL 1 \*(Dt BuGless /usr/share/man/man1/rcs.1.gz:.TH RCS 1 \*(Dt GNU /usr/share/man/man1/rcsclean.1.gz:.TH RCSCLEAN 1 \*(Dt GNU /usr/share/man/man1/rcsdiff.1.gz:.TH RCSDIFF 1 \*(Dt GNU /usr/share/man/man1/rcsfreeze.1.gz:.TH RCSFREEZE 1 \*(Dt GNU /usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU /usr/share/man/man1/rcsmerge.1.gz:.TH RCSMERGE 1 \*(Dt GNU /usr/share/man/man1/rlog.1.gz:.TH RLOG 1 \*(Dt GNU /usr/share/man/man1/rpcgen.1.gz:.TH \*(x} /usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream There doesn't seem to be cases where "\ " is used. I'm inclined to conclude that bug reports should be sent to packages that have pages using backslashes in .TH line. These pages should be converted to use the double quote notation. The main problem is with those pages: - No information can be parsed reliably; there is no delimiters (start, stop) to specify which text is within which. In any case, here is patch to improve the TH detection in cases like the above. Daniel, would you apply this to CVS. Thanks, Jari
>From 35ba3f28fecb3ae38e1187e927cd16480fc91a77 Mon Sep 17 00:00:00 2001 From: Jari Aalto <jari.aa...@cante.net> Date: Fri, 4 Jun 2010 10:12:23 +0300 Subject: [PATCH] roffit: improve TH handling Organization: Private Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Signed-off-by: Jari Aalto <jari.aa...@cante.net> --- roffit | 47 ++++++++++++++++++++++++++++++++++++++++------- 1 files changed, 40 insertions(+), 7 deletions(-) diff --git a/roffit b/roffit index 3149f37..49d01d7 100755 --- a/roffit +++ b/roffit @@ -203,23 +203,56 @@ sub parsefile { $out = ""; # cut off initial spaces - $rest =~ s/^ +//g; + $rest =~ s/^\s+//; - if($keyword eq "\\\"") { + if ( $keyword eq q(\\") ) { # this is a comment, skip this line } - elsif($keyword =~ /^TH$/) { + elsif ( $keyword eq "TH" ) { # man page header: # curl 1 "22 Oct 2003" "Curl 7.10.8" "Curl Manual" + + # Treat pages that have "*(Dt": + # .TH IDENT 1 \*(Dt GNU + + $rest =~ s,\Q\\*(Dt,,g; + + # Delete backslashes + + $rest =~ s,\\,,g; + + # Delete old RCS tags + # .TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream + + $rest =~ s,\$Date:\s+(.*?)\s+\$,$1,g; + # NAME SECTION DATE VERSION MANUAL - if($rest =~ /([^ ]*) (\d+) \"([^\"]*)\" \"([^\"]*)\"(\"([^\"]*)\")?/) { + # section can be: 1 or 3C + + if ( $rest =~ /(\S+)\s+\"?(\d\S?+)\"?\s+\"([^\"]*)\" \"([^\"]*)\"(\"([^\"]*)\")?/ ) { # strict matching only so far - $manpage{'name'} = $1; + $manpage{'name'} = $1; $manpage{'section'} = $2; - $manpage{'date'} = $3; + $manpage{'date'} = $3; $manpage{'version'} = $4; - $manpage{'manual'} = $6; + $manpage{'manual'} = $6; } + # .TH html2text 1 2008-09-20 HH:MM:SS + elsif ( $rest =~ m, (\S+) \s+ \"?(\d\S?+)\"? \s+ \"?([ \d:/-]+)\"? \s* (.*) ,x ) + { + $manpage{'name'} = $1; + $manpage{'section'} = $2; + $manpage{'date'} = $3; + $manpage{'manual'} = $4; + } + # Anything else, like: + # .TH IDENT 1 GNU + elsif ( $rest =~ /(\S+) \s+ \"?(\d\S?+)\"? \s+ (.+)/x ) + { + $manpage{'name'} = $1; + $manpage{'section'} = $2; + $manpage{'manual'} = $3; + } } elsif($keyword =~ /^S[HS]$/) { # SS is treated the same as SH -- 1.7.1