tags 584437 + patch pending
retitle 584437 roffit: [PATCH] better man page header detection (.TH tag)
forwarded 584437 <dan...@haxx.se>
thanks

markus schnalke <mei...@marmaro.de> writes:

>     .TH curl 1 "22 Oct 2003" "Curl 7.10.8" "Curl Manual"
>     .TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual

>From unscientific test:

    $ ls /usr/share/man/man1/*.gz | xargs zgrep '^\.TH.*\\' | egrep -v '\"'

    usr/share/man/man1/ci.1.gz:.TH CI 1 \*(Dt GNU
    /usr/share/man/man1/co.1.gz:.TH CO 1 \*(Dt GNU
    /usr/share/man/man1/evince-thumbnailer.1.gz:.TH evince\-thumbnailer 1 
2007\-01\-15  
    /usr/share/man/man1/formail.1.gz:.TH FORMAIL 1 \*(Dt BuGless
    /usr/share/man/man1/gnome-panel.1.gz:.TH gnome-panel 1 2006\-03\-07
    /usr/share/man/man1/html2text.1.gz:.TH html2text 1 2008\-09\-20
    /usr/share/man/man1/ident.1.gz:.TH IDENT 1 \*(Dt GNU
    /usr/share/man/man1/join-dctrl.1.gz:.TH join\-dctrl 1
    /usr/share/man/man1/lockfile.1.gz:.TH LOCKFILE 1 \*(Dt BuGless
    /usr/share/man/man1/merge.1.gz:.TH MERGE 1 \*(Dt GNU
    /usr/share/man/man1/patch.1.gz:.TH PATCH 1 \*(Dt GNU
    /usr/share/man/man1/procmail.1.gz:.TH PROCMAIL 1 \*(Dt BuGless
    /usr/share/man/man1/rcs.1.gz:.TH RCS 1 \*(Dt GNU
    /usr/share/man/man1/rcsclean.1.gz:.TH RCSCLEAN 1 \*(Dt GNU
    /usr/share/man/man1/rcsdiff.1.gz:.TH RCSDIFF 1 \*(Dt GNU
    /usr/share/man/man1/rcsfreeze.1.gz:.TH RCSFREEZE 1 \*(Dt GNU
    /usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU
    /usr/share/man/man1/rcsmerge.1.gz:.TH RCSMERGE 1 \*(Dt GNU
    /usr/share/man/man1/rlog.1.gz:.TH RLOG 1 \*(Dt GNU
    /usr/share/man/man1/rpcgen.1.gz:.TH \*(x}
    /usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ 
$ i\-scream 

There doesn't seem to be cases where "\ " is used.

I'm inclined to conclude that bug reports should be sent to packages
that have pages using backslashes in .TH line. These pages should be
converted to use the double quote notation. The main problem is with
those pages:

    - No information can be parsed reliably; there is no delimiters
      (start, stop) to specify which text is within which.

In any case, here is patch to improve the TH detection in cases like the
above.

Daniel, would you apply this to CVS.

Thanks,
Jari

>From 35ba3f28fecb3ae38e1187e927cd16480fc91a77 Mon Sep 17 00:00:00 2001
From: Jari Aalto <jari.aa...@cante.net>
Date: Fri, 4 Jun 2010 10:12:23 +0300
Subject: [PATCH] roffit: improve TH handling
Organization: Private
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit


Signed-off-by: Jari Aalto <jari.aa...@cante.net>
---
 roffit |   47 ++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/roffit b/roffit
index 3149f37..49d01d7 100755
--- a/roffit
+++ b/roffit
@@ -203,23 +203,56 @@ sub parsefile {
             $out = "";
             
             # cut off initial spaces
-            $rest =~ s/^ +//g;
+            $rest =~ s/^\s+//;
             
-            if($keyword eq "\\\"") {
+            if ( $keyword eq q(\\") ) {
                 # this is a comment, skip this line
             }
-            elsif($keyword =~ /^TH$/) {
+            elsif ( $keyword eq "TH" ) {
                 # man page header:
                 # curl 1 "22 Oct 2003" "Curl 7.10.8" "Curl Manual"
+
+		# Treat pages that have "*(Dt":
+		# .TH IDENT 1 \*(Dt GNU
+
+		$rest =~ s,\Q\\*(Dt,,g;
+
+		# Delete backslashes
+
+		$rest =~ s,\\,,g;
+
+		# Delete old RCS tags
+		# .TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream
+
+		$rest =~ s,\$Date:\s+(.*?)\s+\$,$1,g;
+
                 # NAME SECTION DATE VERSION MANUAL
-                if($rest =~ /([^ ]*) (\d+) \"([^\"]*)\" \"([^\"]*)\"(\"([^\"]*)\")?/) {
+		# section can be: 1 or 3C
+
+                if ( $rest =~ /(\S+)\s+\"?(\d\S?+)\"?\s+\"([^\"]*)\" \"([^\"]*)\"(\"([^\"]*)\")?/ ) {
                     # strict matching only so far
-                    $manpage{'name'} = $1;
+                    $manpage{'name'}    = $1;
                     $manpage{'section'} = $2;
-                    $manpage{'date'} = $3;
+                    $manpage{'date'}    = $3;
                     $manpage{'version'} = $4;
-                    $manpage{'manual'} = $6;
+                    $manpage{'manual'}  = $6;
                 }
+	        # .TH html2text 1 2008-09-20 HH:MM:SS
+		elsif ( $rest =~  m, (\S+) \s+ \"?(\d\S?+)\"? \s+ \"?([ \d:/-]+)\"? \s* (.*) ,x )
+		{
+                    $manpage{'name'}    = $1;
+                    $manpage{'section'} = $2;
+                    $manpage{'date'}    = $3;
+                    $manpage{'manual'}  = $4;
+		}
+	        # Anything else, like:
+		# .TH IDENT 1 GNU
+		elsif ( $rest =~ /(\S+) \s+ \"?(\d\S?+)\"? \s+ (.+)/x )
+		{
+                    $manpage{'name'}    = $1;
+                    $manpage{'section'} = $2;
+                    $manpage{'manual'}  = $3;
+		}
             }
             elsif($keyword =~ /^S[HS]$/) {
                 # SS is treated the same as SH
-- 
1.7.1

Reply via email to