Bug#584437: roffit: better man page header detection

2010-06-04 Thread Jari Aalto
tags 584437 + patch pending
retitle 584437 roffit: [PATCH] better man page header detection (.TH tag)
forwarded 584437 dan...@haxx.se
thanks

markus schnalke mei...@marmaro.de writes:

 .TH curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual
 .TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual

From unscientific test:

$ ls /usr/share/man/man1/*.gz | xargs zgrep '^\.TH.*\\' | egrep -v '\'

usr/share/man/man1/ci.1.gz:.TH CI 1 \*(Dt GNU
/usr/share/man/man1/co.1.gz:.TH CO 1 \*(Dt GNU
/usr/share/man/man1/evince-thumbnailer.1.gz:.TH evince\-thumbnailer 1 
2007\-01\-15  
/usr/share/man/man1/formail.1.gz:.TH FORMAIL 1 \*(Dt BuGless
/usr/share/man/man1/gnome-panel.1.gz:.TH gnome-panel 1 2006\-03\-07
/usr/share/man/man1/html2text.1.gz:.TH html2text 1 2008\-09\-20
/usr/share/man/man1/ident.1.gz:.TH IDENT 1 \*(Dt GNU
/usr/share/man/man1/join-dctrl.1.gz:.TH join\-dctrl 1
/usr/share/man/man1/lockfile.1.gz:.TH LOCKFILE 1 \*(Dt BuGless
/usr/share/man/man1/merge.1.gz:.TH MERGE 1 \*(Dt GNU
/usr/share/man/man1/patch.1.gz:.TH PATCH 1 \*(Dt GNU
/usr/share/man/man1/procmail.1.gz:.TH PROCMAIL 1 \*(Dt BuGless
/usr/share/man/man1/rcs.1.gz:.TH RCS 1 \*(Dt GNU
/usr/share/man/man1/rcsclean.1.gz:.TH RCSCLEAN 1 \*(Dt GNU
/usr/share/man/man1/rcsdiff.1.gz:.TH RCSDIFF 1 \*(Dt GNU
/usr/share/man/man1/rcsfreeze.1.gz:.TH RCSFREEZE 1 \*(Dt GNU
/usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU
/usr/share/man/man1/rcsmerge.1.gz:.TH RCSMERGE 1 \*(Dt GNU
/usr/share/man/man1/rlog.1.gz:.TH RLOG 1 \*(Dt GNU
/usr/share/man/man1/rpcgen.1.gz:.TH \*(x}
/usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ 
$ i\-scream 

There doesn't seem to be cases where \  is used.

I'm inclined to conclude that bug reports should be sent to packages
that have pages using backslashes in .TH line. These pages should be
converted to use the double quote notation. The main problem is with
those pages:

- No information can be parsed reliably; there is no delimiters
  (start, stop) to specify which text is within which.

In any case, here is patch to improve the TH detection in cases like the
above.

Daniel, would you apply this to CVS.

Thanks,
Jari

From 35ba3f28fecb3ae38e1187e927cd16480fc91a77 Mon Sep 17 00:00:00 2001
From: Jari Aalto jari.aa...@cante.net
Date: Fri, 4 Jun 2010 10:12:23 +0300
Subject: [PATCH] roffit: improve TH handling
Organization: Private
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit


Signed-off-by: Jari Aalto jari.aa...@cante.net
---
 roffit |   47 ---
 1 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/roffit b/roffit
index 3149f37..49d01d7 100755
--- a/roffit
+++ b/roffit
@@ -203,23 +203,56 @@ sub parsefile {
 $out = ;
 
 # cut off initial spaces
-$rest =~ s/^ +//g;
+$rest =~ s/^\s+//;
 
-if($keyword eq \\\) {
+if ( $keyword eq q(\\) ) {
 # this is a comment, skip this line
 }
-elsif($keyword =~ /^TH$/) {
+elsif ( $keyword eq TH ) {
 # man page header:
 # curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual
+
+		# Treat pages that have *(Dt:
+		# .TH IDENT 1 \*(Dt GNU
+
+		$rest =~ s,\Q\\*(Dt,,g;
+
+		# Delete backslashes
+
+		$rest =~ s,\\,,g;
+
+		# Delete old RCS tags
+		# .TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream
+
+		$rest =~ s,\$Date:\s+(.*?)\s+\$,$1,g;
+
 # NAME SECTION DATE VERSION MANUAL
-if($rest =~ /([^ ]*) (\d+) \([^\]*)\ \([^\]*)\(\([^\]*)\)?/) {
+		# section can be: 1 or 3C
+
+if ( $rest =~ /(\S+)\s+\?(\d\S?+)\?\s+\([^\]*)\ \([^\]*)\(\([^\]*)\)?/ ) {
 # strict matching only so far
-$manpage{'name'} = $1;
+$manpage{'name'}= $1;
 $manpage{'section'} = $2;
-$manpage{'date'} = $3;
+$manpage{'date'}= $3;
 $manpage{'version'} = $4;
-$manpage{'manual'} = $6;
+$manpage{'manual'}  = $6;
 }
+	# .TH html2text 1 2008-09-20 HH:MM:SS
+		elsif ( $rest =~  m, (\S+) \s+ \?(\d\S?+)\? \s+ \?([ \d:/-]+)\? \s* (.*) ,x )
+		{
+$manpage{'name'}= $1;
+$manpage{'section'} = $2;
+$manpage{'date'}= $3;
+$manpage{'manual'}  = $4;
+		}
+	# Anything else, like:
+		# .TH IDENT 1 GNU
+		elsif ( $rest =~ /(\S+) \s+ \?(\d\S?+)\? \s+ (.+)/x )
+		{
+$manpage{'name'}= $1;
+$manpage{'section'} = $2;
+$manpage{'manual'}  = $3;
+		}
 }
 elsif($keyword =~ /^S[HS]$/) {
 # SS is treated the same as SH
-- 
1.7.1



Bug#584437: roffit: better man page header detection

2010-06-04 Thread markus schnalke
[2010-06-04 11:02] Jari Aalto jari.aa...@cante.net
 markus schnalke mei...@marmaro.de writes:
 
  .TH curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual
  .TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual
 
 From unscientific test:
 
 $ ls /usr/share/man/man1/*.gz | xargs zgrep '^\.TH.*\\' | egrep -v '\'
 
 usr/share/man/man1/ci.1.gz:.TH CI 1 \*(Dt GNU
 /usr/share/man/man1/co.1.gz:.TH CO 1 \*(Dt GNU
 /usr/share/man/man1/evince-thumbnailer.1.gz:.TH evince\-thumbnailer 1 
 2007\-01\-15  
 /usr/share/man/man1/formail.1.gz:.TH FORMAIL 1 \*(Dt BuGless
 /usr/share/man/man1/gnome-panel.1.gz:.TH gnome-panel 1 2006\-03\-07
 /usr/share/man/man1/html2text.1.gz:.TH html2text 1 2008\-09\-20
 /usr/share/man/man1/ident.1.gz:.TH IDENT 1 \*(Dt GNU
 /usr/share/man/man1/join-dctrl.1.gz:.TH join\-dctrl 1
 /usr/share/man/man1/lockfile.1.gz:.TH LOCKFILE 1 \*(Dt BuGless
 /usr/share/man/man1/merge.1.gz:.TH MERGE 1 \*(Dt GNU
 /usr/share/man/man1/patch.1.gz:.TH PATCH 1 \*(Dt GNU
 /usr/share/man/man1/procmail.1.gz:.TH PROCMAIL 1 \*(Dt BuGless
 /usr/share/man/man1/rcs.1.gz:.TH RCS 1 \*(Dt GNU
 /usr/share/man/man1/rcsclean.1.gz:.TH RCSCLEAN 1 \*(Dt GNU
 /usr/share/man/man1/rcsdiff.1.gz:.TH RCSDIFF 1 \*(Dt GNU
 /usr/share/man/man1/rcsfreeze.1.gz:.TH RCSFREEZE 1 \*(Dt GNU
 /usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU
 /usr/share/man/man1/rcsmerge.1.gz:.TH RCSMERGE 1 \*(Dt GNU
 /usr/share/man/man1/rlog.1.gz:.TH RLOG 1 \*(Dt GNU
 /usr/share/man/man1/rpcgen.1.gz:.TH \*(x}
 /usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 
 23:42:42\ $ i\-scream 
 
 There doesn't seem to be cases where \  is used.

The last line is such a case.

 I'm inclined to conclude that bug reports should be sent to packages
 that have pages using backslashes in .TH line. These pages should be
 converted to use the double quote notation.

I could agree for TH lines with escaped spaces, but not for using any
backslashes in TH lines.

Especially \- must be possible as it means something different to -.

 The main problem is with
 those pages:
 
 - No information can be parsed reliably; there is no delimiters
   (start, stop) to specify which text is within which.

If you parse it char for char, then you can parse it reliable. Nroff
can do it. But I don't think we want this overhead here.


The most important thing is detecting the first two parameters (name
and section). These will almost always be detectable without problems.
If we can detect them, we should display them in the page title. The
``secret man page'' should then appear almost never.

For all the other parameters we should try to detect them as good as
possible. If we can detect values then we should use them, otherwise
we should just ignore them. IMO we can ignore escaped spaces here.


 In any case, here is patch to improve the TH detection in cases like the
 above.
 
 Daniel, would you apply this to CVS.

I think we can still improve that one. Let's do it in two steps:

First detect the first two arguments, which will succeed almost
always. And as a separate step we could try to detect the rest.

In general: Did you notice that nothing but the first argument of TH
is ever used by roffit? Thus we should think about how much code we
put into roffit to detect the other arguments.

It might be enough to detect the first two arguments which will be
successful in most cases, and we don't have to mess around with the
rest.


Unsolved still is \*(Dt. Your patch deletes it. This might be the best
solution for now.


meillo



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#584437: roffit: better man page header detection

2010-06-04 Thread Jari Aalto
markus schnalke mei...@marmaro.de writes:
 /usr/share/man/man1/rcsintro.1.gz:.TH RCSINTRO 1 \*(Dt GNU
 /usr/share/man/man1/saidar.1.gz:.TH saidar 1 $Date:\ 2006/11/30\ 
 23:42:42\ $ i\-scream 
 
 The last line is such a case.

Handled n the patch.

 If you parse it char for char, then you can parse it

I meant thet You can't read information from space delimited text, where
the information means different things. It needs a quote to say BEGIN
and quote to say END for:

NAME SECTION DATE VERSION MANUAL

 The most important thing is detecting the first two parameters

 ... First detect the first two arguments, which will succeed almost
 always.

Added final ELSIF case. Daniel, use this.

Jari

From 5675160c2b879b9d4b9b29e16224a8090ce32b0a Mon Sep 17 00:00:00 2001
From: Jari Aalto jari.aa...@cante.net
Date: Fri, 4 Jun 2010 10:12:23 +0300
Subject: [PATCH] roffit: improve TH handling
Organization: Private
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit


Signed-off-by: Jari Aalto jari.aa...@cante.net
---
 roffit |   52 +---
 1 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/roffit b/roffit
index 3149f37..ae55406 100755
--- a/roffit
+++ b/roffit
@@ -203,23 +203,61 @@ sub parsefile {
 $out = ;
 
 # cut off initial spaces
-$rest =~ s/^ +//g;
+$rest =~ s/^\s+//;
 
-if($keyword eq \\\) {
+if ( $keyword eq q(\\) ) {
 # this is a comment, skip this line
 }
-elsif($keyword =~ /^TH$/) {
+elsif ( $keyword eq TH ) {
 # man page header:
 # curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual
+
+		# Treat pages that have *(Dt:
+		# .TH IDENT 1 \*(Dt GNU
+
+		$rest =~ s,\Q\\*(Dt,,g;
+
+		# Delete backslashes
+
+		$rest =~ s,\\,,g;
+
+		# Delete old RCS tags
+		# .TH saidar 1 $Date:\ 2006/11/30\ 23:42:42\ $ i\-scream
+
+		$rest =~ s,\$Date:\s+(.*?)\s+\$,$1,g;
+
 # NAME SECTION DATE VERSION MANUAL
-if($rest =~ /([^ ]*) (\d+) \([^\]*)\ \([^\]*)\(\([^\]*)\)?/) {
+		# section can be: 1 or 3C
+
+if ( $rest =~ /(\S+)\s+\?(\d\S?+)\?\s+\([^\]*)\ \([^\]*)\(\([^\]*)\)?/ ) {
 # strict matching only so far
-$manpage{'name'} = $1;
+$manpage{'name'}= $1;
 $manpage{'section'} = $2;
-$manpage{'date'} = $3;
+$manpage{'date'}= $3;
 $manpage{'version'} = $4;
-$manpage{'manual'} = $6;
+$manpage{'manual'}  = $6;
 }
+	# .TH html2text 1 2008-09-20 HH:MM:SS
+		elsif ( $rest =~  m, (\S+) \s+ \?(\d\S?+)\? \s+ \?([ \d:/-]+)\? \s* (.*) ,x )
+		{
+$manpage{'name'}= $1;
+$manpage{'section'} = $2;
+$manpage{'date'}= $3;
+$manpage{'manual'}  = $4;
+		}
+		# .TH program 1 description
+		elsif ( $rest =~ /(\S+) \s+ \?(\d\S?+)\? \s+ (.+)/x )
+		{
+$manpage{'name'}= $1;
+$manpage{'section'} = $2;
+$manpage{'manual'}  = $3;
+		}
+		# .TH program 1
+		elsif ( $rest =~ /(\S+) \s+ \?(\d\S?+)\? /x )
+		{
+$manpage{'name'}= $1;
+$manpage{'section'} = $2;
+		}
 }
 elsif($keyword =~ /^S[HS]$/) {
 # SS is treated the same as SH
-- 
1.7.1



Bug#584437: roffit: better man page header detection

2010-06-03 Thread markus schnalke
Package: roffit
Version: 0.6+cvs20090507-1
Severity: wishlist
Tags: upstream patch

Roffit is very strict in detecting man page headers (TH lines).

It should recognize fields correctly if unneccesary double quotes are
omitted. Also, arbitrary white space between fields should not matter.
The attached patch probably fixes these problems.


One issue is still unsolved: Escaped spaces. For nroff the following
lines are equivalent:

.TH curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual

.TH curl 1 22\ Oct\ 2003 Curl\ 7.10.8 Curl\ Manual

This corner-case will probably seldom show up, however. Fixing it
might require more than the single regexp that is used currently.


-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)

Kernel: Linux 2.6.30-2-686-bigmem (SMP w/1 CPU core)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages roffit depends on:
ii  perl  5.10.1-12  Larry Wall's Practical Extraction 

roffit recommends no packages.

roffit suggests no packages.

-- no debconf information
--- roffit.orig 2010-06-03 16:30:34.0 +0200
+++ roffit  2010-06-03 16:56:57.0 +0200
@@ -212,13 +212,19 @@
 # man page header:
 # curl 1 22 Oct 2003 Curl 7.10.8 Curl Manual
 # NAME SECTION DATE VERSION MANUAL
-if($rest =~ /([^ ]*) (\d+) \([^\]*)\ 
\([^\]*)\(\([^\]*)\)?/) {
+if($rest =~ /
+([^ ]+)[ \t]+
+(\d+)[ \t]+
+(\[^\]+\|[^ \t]+)[ \t]+
+(\[^\]+\|[^ \t]+)[ \t]+
+(\[^\]+\|[^ \t]+)?
+/x) {
 # strict matching only so far
 $manpage{'name'} = $1;
 $manpage{'section'} = $2;
 $manpage{'date'} = $3;
 $manpage{'version'} = $4;
-$manpage{'manual'} = $6;
+$manpage{'manual'} = $5;
 }
 }
 elsif($keyword =~ /^S[HS]$/) {