Bug#584437: roffit: better man page header detection
tags 584437 + patch pending retitle 584437 roffit: [PATCH] better man page header detection (.TH tag) forwarded 584437 dan...@haxx.se thanks markus schnalke mei...@marmaro.de writes: .TH curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual .TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual From unscientific test: $ ls /usr/share/man/man1/*.gz | xargs zgrep '^\.TH.*\\' | egrep -v '\' usr/share/man/man1/ci.1.gz:.TH CI 1 \*(Dt GNU /usr/share/man/man1/co.1.gz:.TH CO 1 \*(Dt GNU /usr/share/man/man1/evince-thumbnailer.1.gz:.TH evince\-thumbnailer 1 2007\-01\-15 /usr/share/man/man1/formail.1.gz:.TH FORMAIL 1 \*(Dt BuGless /usr/share/man/man1/gnome-panel.1.gz:.TH gnome-panel 1 2006\-03\-07 /usr/share/man/man1/html2text.1.gz:.TH html2text 1 2008\-09\-20 /usr/share/man/man1/ident.1.gz:.TH IDENT 1 \*(Dt GNU /usr/share/man/man1/join-dctrl.1.gz:.TH join\-dctrl 1 /usr/share/man/man1/lockfile.1.gz:.TH LOCKFILE 1 \*(Dt BuGless /usr/share/man/man1/merge.1.gz:.TH MERGE 1 \*(Dt GNU /usr/share/man/man1/patch.1.gz:.TH PATCH 1 \*(Dt GNU /usr/share/man/man1/procmail.1.gz:.TH PROCMAIL 1 \*(Dt BuGless /usr/share/man/man1/rcs.1.gz:.TH RCS 1 \*(Dt GNU /usr/share/man/man1/rcsclean.1.gz:.TH RCSCLEAN 1 \*(Dt GNU /usr/share/man/man1/rcsdiff.1.gz:.TH RCSDIFF 1 \*(Dt GNU /usr/share/man/man1/rcsfreeze.1.gz:.TH RCSFREEZE 1 \*(Dt GNU /usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU /usr/share/man/man1/rcsmerge.1.gz:.TH RCSMERGE 1 \*(Dt GNU /usr/share/man/man1/rlog.1.gz:.TH RLOG 1 \*(Dt GNU /usr/share/man/man1/rpcgen.1.gz:.TH \*(x} /usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream There doesn't seem to be cases where \ is used. I'm inclined to conclude that bug reports should be sent to packages that have pages using backslashes in .TH line. These pages should be converted to use the double quote notation. The main problem is with those pages: - No information can be parsed reliably; there is no delimiters (start, stop) to specify which text is within which. In any case, here is patch to improve the TH detection in cases like the above. Daniel, would you apply this to CVS. Thanks, Jari From 35ba3f28fecb3ae38e1187e927cd16480fc91a77 Mon Sep 17 00:00:00 2001 From: Jari Aalto jari.aa...@cante.net Date: Fri, 4 Jun 2010 10:12:23 +0300 Subject: [PATCH] roffit: improve TH handling Organization: Private Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Signed-off-by: Jari Aalto jari.aa...@cante.net --- roffit | 47 --- 1 files changed, 40 insertions(+), 7 deletions(-) diff --git a/roffit b/roffit index 3149f37..49d01d7 100755 --- a/roffit +++ b/roffit @@ -203,23 +203,56 @@ sub parsefile { $out = ; # cut off initial spaces -$rest =~ s/^ +//g; +$rest =~ s/^\s+//; -if($keyword eq \\\) { +if ( $keyword eq q(\\) ) { # this is a comment, skip this line } -elsif($keyword =~ /^TH$/) { +elsif ( $keyword eq TH ) { # man page header: # curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual + + # Treat pages that have *(Dt: + # .TH IDENT 1 \*(Dt GNU + + $rest =~ s,\Q\\*(Dt,,g; + + # Delete backslashes + + $rest =~ s,\\,,g; + + # Delete old RCS tags + # .TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream + + $rest =~ s,\$Date:\s+(.*?)\s+\$,$1,g; + # NAME SECTION DATE VERSION MANUAL -if($rest =~ /([^ ]*) (\d+) \([^\]*)\ \([^\]*)\(\([^\]*)\)?/) { + # section can be: 1 or 3C + +if ( $rest =~ /(\S+)\s+\?(\d\S?+)\?\s+\([^\]*)\ \([^\]*)\(\([^\]*)\)?/ ) { # strict matching only so far -$manpage{'name'} = $1; +$manpage{'name'}= $1; $manpage{'section'} = $2; -$manpage{'date'} = $3; +$manpage{'date'}= $3; $manpage{'version'} = $4; -$manpage{'manual'} = $6; +$manpage{'manual'} = $6; } + # .TH html2text 1 2008-09-20 HH:MM:SS + elsif ( $rest =~ m, (\S+) \s+ \?(\d\S?+)\? \s+ \?([ \d:/-]+)\? \s* (.*) ,x ) + { +$manpage{'name'}= $1; +$manpage{'section'} = $2; +$manpage{'date'}= $3; +$manpage{'manual'} = $4; + } + # Anything else, like: + # .TH IDENT 1 GNU + elsif ( $rest =~ /(\S+) \s+ \?(\d\S?+)\? \s+ (.+)/x ) + { +$manpage{'name'}= $1; +$manpage{'section'} = $2; +$manpage{'manual'} = $3; + } } elsif($keyword =~ /^S[HS]$/) { # SS is treated the same as SH -- 1.7.1
Bug#584437: roffit: better man page header detection
[2010-06-04 11:02] Jari Aalto jari.aa...@cante.net markus schnalke mei...@marmaro.de writes: .TH curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual .TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual From unscientific test: $ ls /usr/share/man/man1/*.gz | xargs zgrep '^\.TH.*\\' | egrep -v '\' usr/share/man/man1/ci.1.gz:.TH CI 1 \*(Dt GNU /usr/share/man/man1/co.1.gz:.TH CO 1 \*(Dt GNU /usr/share/man/man1/evince-thumbnailer.1.gz:.TH evince\-thumbnailer 1 2007\-01\-15 /usr/share/man/man1/formail.1.gz:.TH FORMAIL 1 \*(Dt BuGless /usr/share/man/man1/gnome-panel.1.gz:.TH gnome-panel 1 2006\-03\-07 /usr/share/man/man1/html2text.1.gz:.TH html2text 1 2008\-09\-20 /usr/share/man/man1/ident.1.gz:.TH IDENT 1 \*(Dt GNU /usr/share/man/man1/join-dctrl.1.gz:.TH join\-dctrl 1 /usr/share/man/man1/lockfile.1.gz:.TH LOCKFILE 1 \*(Dt BuGless /usr/share/man/man1/merge.1.gz:.TH MERGE 1 \*(Dt GNU /usr/share/man/man1/patch.1.gz:.TH PATCH 1 \*(Dt GNU /usr/share/man/man1/procmail.1.gz:.TH PROCMAIL 1 \*(Dt BuGless /usr/share/man/man1/rcs.1.gz:.TH RCS 1 \*(Dt GNU /usr/share/man/man1/rcsclean.1.gz:.TH RCSCLEAN 1 \*(Dt GNU /usr/share/man/man1/rcsdiff.1.gz:.TH RCSDIFF 1 \*(Dt GNU /usr/share/man/man1/rcsfreeze.1.gz:.TH RCSFREEZE 1 \*(Dt GNU /usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU /usr/share/man/man1/rcsmerge.1.gz:.TH RCSMERGE 1 \*(Dt GNU /usr/share/man/man1/rlog.1.gz:.TH RLOG 1 \*(Dt GNU /usr/share/man/man1/rpcgen.1.gz:.TH \*(x} /usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream There doesn't seem to be cases where \ is used. The last line is such a case. I'm inclined to conclude that bug reports should be sent to packages that have pages using backslashes in .TH line. These pages should be converted to use the double quote notation. I could agree for TH lines with escaped spaces, but not for using any backslashes in TH lines. Especially \- must be possible as it means something different to -. The main problem is with those pages: - No information can be parsed reliably; there is no delimiters (start, stop) to specify which text is within which. If you parse it char for char, then you can parse it reliable. Nroff can do it. But I don't think we want this overhead here. The most important thing is detecting the first two parameters (name and section). These will almost always be detectable without problems. If we can detect them, we should display them in the page title. The ``secret man page'' should then appear almost never. For all the other parameters we should try to detect them as good as possible. If we can detect values then we should use them, otherwise we should just ignore them. IMO we can ignore escaped spaces here. In any case, here is patch to improve the TH detection in cases like the above. Daniel, would you apply this to CVS. I think we can still improve that one. Let's do it in two steps: First detect the first two arguments, which will succeed almost always. And as a separate step we could try to detect the rest. In general: Did you notice that nothing but the first argument of TH is ever used by roffit? Thus we should think about how much code we put into roffit to detect the other arguments. It might be enough to detect the first two arguments which will be successful in most cases, and we don't have to mess around with the rest. Unsolved still is \*(Dt. Your patch deletes it. This might be the best solution for now. meillo -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#584437: roffit: better man page header detection
markus schnalke mei...@marmaro.de writes: /usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU /usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream The last line is such a case. Handled n the patch. If you parse it char for char, then you can parse it I meant thet You can't read information from space delimited text, where the information means different things. It needs a quote to say BEGIN and quote to say END for: NAME SECTION DATE VERSION MANUAL The most important thing is detecting the first two parameters ... First detect the first two arguments, which will succeed almost always. Added final ELSIF case. Daniel, use this. Jari From 5675160c2b879b9d4b9b29e16224a8090ce32b0a Mon Sep 17 00:00:00 2001 From: Jari Aalto jari.aa...@cante.net Date: Fri, 4 Jun 2010 10:12:23 +0300 Subject: [PATCH] roffit: improve TH handling Organization: Private Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Signed-off-by: Jari Aalto jari.aa...@cante.net --- roffit | 52 +--- 1 files changed, 45 insertions(+), 7 deletions(-) diff --git a/roffit b/roffit index 3149f37..ae55406 100755 --- a/roffit +++ b/roffit @@ -203,23 +203,61 @@ sub parsefile { $out = ; # cut off initial spaces -$rest =~ s/^ +//g; +$rest =~ s/^\s+//; -if($keyword eq \\\) { +if ( $keyword eq q(\\) ) { # this is a comment, skip this line } -elsif($keyword =~ /^TH$/) { +elsif ( $keyword eq TH ) { # man page header: # curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual + + # Treat pages that have *(Dt: + # .TH IDENT 1 \*(Dt GNU + + $rest =~ s,\Q\\*(Dt,,g; + + # Delete backslashes + + $rest =~ s,\\,,g; + + # Delete old RCS tags + # .TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream + + $rest =~ s,\$Date:\s+(.*?)\s+\$,$1,g; + # NAME SECTION DATE VERSION MANUAL -if($rest =~ /([^ ]*) (\d+) \([^\]*)\ \([^\]*)\(\([^\]*)\)?/) { + # section can be: 1 or 3C + +if ( $rest =~ /(\S+)\s+\?(\d\S?+)\?\s+\([^\]*)\ \([^\]*)\(\([^\]*)\)?/ ) { # strict matching only so far -$manpage{'name'} = $1; +$manpage{'name'}= $1; $manpage{'section'} = $2; -$manpage{'date'} = $3; +$manpage{'date'}= $3; $manpage{'version'} = $4; -$manpage{'manual'} = $6; +$manpage{'manual'} = $6; } + # .TH html2text 1 2008-09-20 HH:MM:SS + elsif ( $rest =~ m, (\S+) \s+ \?(\d\S?+)\? \s+ \?([ \d:/-]+)\? \s* (.*) ,x ) + { +$manpage{'name'}= $1; +$manpage{'section'} = $2; +$manpage{'date'}= $3; +$manpage{'manual'} = $4; + } + # .TH program 1 description + elsif ( $rest =~ /(\S+) \s+ \?(\d\S?+)\? \s+ (.+)/x ) + { +$manpage{'name'}= $1; +$manpage{'section'} = $2; +$manpage{'manual'} = $3; + } + # .TH program 1 + elsif ( $rest =~ /(\S+) \s+ \?(\d\S?+)\? /x ) + { +$manpage{'name'}= $1; +$manpage{'section'} = $2; + } } elsif($keyword =~ /^S[HS]$/) { # SS is treated the same as SH -- 1.7.1
Bug#584437: roffit: better man page header detection
Package: roffit Version: 0.6+cvs20090507-1 Severity: wishlist Tags: upstream patch Roffit is very strict in detecting man page headers (TH lines). It should recognize fields correctly if unneccesary double quotes are omitted. Also, arbitrary white space between fields should not matter. The attached patch probably fixes these problems. One issue is still unsolved: Escaped spaces. For nroff the following lines are equivalent: .TH curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual .TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual This corner-case will probably seldom show up, however. Fixing it might require more than the single regexp that is used currently. -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable') Architecture: i386 (i686) Kernel: Linux 2.6.30-2-686-bigmem (SMP w/1 CPU core) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages roffit depends on: ii perl 5.10.1-12 Larry Wall's Practical Extraction roffit recommends no packages. roffit suggests no packages. -- no debconf information --- roffit.orig 2010-06-03 16:30:34.0 +0200 +++ roffit 2010-06-03 16:56:57.0 +0200 @@ -212,13 +212,19 @@ # man page header: # curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual # NAME SECTION DATE VERSION MANUAL -if($rest =~ /([^ ]*) (\d+) \([^\]*)\ \([^\]*)\(\([^\]*)\)?/) { +if($rest =~ / +([^ ]+)[ \t]+ +(\d+)[ \t]+ +(\[^\]+\|[^ \t]+)[ \t]+ +(\[^\]+\|[^ \t]+)[ \t]+ +(\[^\]+\|[^ \t]+)? +/x) { # strict matching only so far $manpage{'name'} = $1; $manpage{'section'} = $2; $manpage{'date'} = $3; $manpage{'version'} = $4; -$manpage{'manual'} = $6; +$manpage{'manual'} = $5; } } elsif($keyword =~ /^S[HS]$/) {