Package: catdoc
Version: 1:0.95-4.1
Severity: minor
Tags: patch

  The patch is in the attachment

  Summary:

Remove space at end of lines.

Fix warnings from test-groff.

Change - to \- if it shall be printed as a minus.

Change \\ to \e to print the escape character.

Change a HYPHEN-MINUS (code 0x55, 2D) to a dash
(minus) if it matches " -[:alpha:]" or \(aq-[:alpha:] (for options).

Add a small space around "|" to increases readability.

Add a comma (or \&) after "e.g." and "i.e.", or use English words.

Increase space between sentences to two space characters.

Split long lines (> 80).

Use \(en for a dash where appropriate.

Split a punctuation from a single argument for a two-fonts marco

###

  Details:

Input file is catdoc.1

mandoc: catdoc.1:6:38: STYLE: unterminated quoted argument

mandoc: catdoc.1:6:40: STYLE: whitespace at end of input line
  Many such lines

#######

Test nr. 2:

Enable and fix warnings from 'test-groff'.

<catdoc.1>:125 (macro BR): only 1 argument, but more are expected
<catdoc.1>:241 (macro BR): only 1 argument, but more are expected

Output is from: test-groff -b -e -mandoc -T utf8 -rF0 -t -w w -z

  [ "test-groff" is a developmental version of "groff" ]

####

Test nr. 18:

Change - to \- if it shall be printed as a minus sign.

132:.B -8

#####

Test nr. 20:

Use "\e" to print the escape character instead of "\\" (which gets
interpreted in copy mode).

145:causes catdoc to output unknown UNICODE character as \\xNNNN, instead

#####

Test nr. 25:

Change a HYPHEN-MINUS (code 0x55, 2D) to a minus (\-), if in front of a
name for an option.

6:.BR catdoc " [" -vlu8btawxV "] [" -m " 
9:.B -s
12:.B -d 
15:.B -f
60:.B -w
70:.B -a 
71:- shortcut for -f ascii. Produces ASCII text as output.
74:.B -b
84:.BI -d charset
92:.BI -f format
98:.B  -l
103:.BI -m number
105:.B -m 0
107:.B -w
109:.BI -s charset
119:.B -t
121:.B -f tex
127:.B -u
136:.B -w
144:.B -x
148:.B -v
152:.B -V
247:.B -s

#####

Test nr. 34:

Add a space around "|" to increase readability

284:.BI "use_locale =" "(yes|no)"

#####

Test nr. 40:

Add a comma (or \&) after "e.g." and "i.e.", or use English words
(man-pages(7) [package "manpages"]).

261:with directory of executable file. Empty element in list (i.e. two

#####

Test nr. 41:

Wrong distance between sentences or protect the indicator.

1) Separate the sentences and subordinate clauses; each begins on a new
line.  See man-pages(7) [package "manpages"] and "info groff".

Or

2) Adjust space between sentences (two spaces),

3) or protect the indicator by adding "\&" after it.

The "indicator" is an "end-of-sentence character" (.!?).

29:tries to write correct headers for LaTeX tabular environment. Additional
36:missing from output charset. See CHARACTER SUBSTITUTION below 
48:processes its standard input unless it is terminal. It is unlikely that 
59:blank lines. This behavior can be turned of by 
61:switch. In 
71:- shortcut for -f ascii. Produces ASCII text as output.
75:- process broken MS-Word file. Normally,
77:of file is Microsoft OLE signature. If so, it processes file, otherwise
78:it just copies it to stdin. It is intended to use 
85:- specifies destination charset name. Charset file has format described in
95:comes with two output formats - ascii and tex. You can add your own if you
110:Specifies source charset. (one used in Word document), if Word document
111:doesn't contain UTF-16  text. When reading rtf documents, it is
113:specification. But it can be set wrong by Word (I've seen RTF documents
114:on Russian, where cp1252 was specified). In this case this option would
115:take precedence over charset, specified in the document. But
124:into appropriate control sequences. Separates table columns by 
129:of text (as some Word-97 documents). If catdoc fails to correct  Word 
document
133:- declares is Word document is 8 bit. Just in case that catdoc
137:disables word wrapping. By default 
141:are separated by blank line. With this option each paragraph is one
159: -  input and output. They are stored in plain text files in 
161:library directory. Character set files should contain two 
whitespace-separated
166:distribution includes some of these character sets. Additional character set
169:can be obtained from ftp.unicode.org. Charset files have
176:is distributed with Cyrillic charsets as default. If you are not
181:that Microsoft never uses ISO charsets. While letters in, say cp1252 are
183:lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
191:1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
193:2. Table cells within row are separated by ASCII Field Separator symbol
196:3. Table rows are separated by ASCII Record Separator (0x001E) 
198:4. All printable characters, including whitespace are represented with their
204:1. List of special characters is searched for given Unicode character.
208:2. If there is an equivalent in target character set, it is output.
210:3. Otherwise, replacement list is searched and, if there is multi-character
213:4. If all above fails, "Unknown char" symbol (question mark) is output.
223:library directory in files with prefix of format name. These files have
228:would be substituted instead of it. If string contain no whitespace it 
230:quotes. Usual backslash sequences like 
248:option specified. Consult configuration of nearby windows
252: Sets default output charset. You probably know, which one you use.
261:with directory of executable file. Empty element in list (i.e. two
288:system locale settings (if enabled at compile time). If automatic
291:locale charset is used instead. There are no automatic choice of input
298:fast-saves properly. Prints footnotes as separate paragraphs at the end of
299:file, instead of producing correct LaTeX commands. Cannot distinguish

#####

Test nr. 42:

Split lines longer than 80 characters into two or more lines.
Appropriate break points are the end of a sentence and a subordinate
clause; after punctuation marks

catdoc.1: line 3        length 82
catdoc \- reads MS-Word file and puts its content as plain text on standard 
output

catdoc.1: line 89       length 89
.B catdoc library directory ( ${prefix}/lib/x86_64-linux-gnu/catdoc). By 
default, current

#####

Test nr. 44:

Use \(en for a dash (en-dash) between space characters, not a minus
(\-) or a hyphen (-), except in the NAME section.

71:- shortcut for -f ascii. Produces ASCII text as output.
75:- process broken MS-Word file. Normally,
85:- specifies destination charset name. Charset file has format described in
93:- specifies output format as described in CHARACTER SUBSTITUTION below.
95:comes with two output formats - ascii and tex. You can add your own if you
120:- shortcut for 
128:- declares that Word  document  contain  UNICODE   (UTF-16) representation
133:- declares is Word document is 8 bit. Just in case that catdoc
159: -  input and output. They are stored in plain text files in 
162:hexadecimal numbers - 8-bit code in character set and 16-bit Unicode code.
274:comes with two formats - 
276:but nothing prevents you from writing your own format (set two map files -
281:Character specification can have one of two form - character enclosed in

#####

Test nr. 52:

Split a punctuation from a single argument for a two-fonts marco

125:.BR &.
241:.BR ${HOME}/.catdocrc.

-- System Information:
Debian Release: buster/sid
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'proposed-updates'), (500, 
'testing'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.9.110-3 (SMP w/2 CPU cores)
Locale: LANG=is_IS.iso88591, LC_CTYPE=is_IS.iso88591 (charmap=ISO-8859-1), 
LANGUAGE=is_IS.iso88591 (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)

Versions of packages catdoc depends on:
ii  libc6  2.27-5

catdoc recommends no packages.

Versions of packages catdoc suggests:
pn  tk | wish  <none>

-- no debconf information

-- 
Bjarni I. Gislason
--- catdoc.1    2017-11-05 22:48:29.000000000 +0000
+++ catdoc.1.new        2018-07-24 00:37:46.000000000 +0000
@@ -1,18 +1,16 @@
-.TH catdoc 1  "Version @catdoc_version@" "MS-Word reader"
+.TH catdoc 1 "Version @catdoc_version@" "MS-Word reader"
 .SH NAME
-catdoc \- reads MS-Word file and puts its content as plain text on standard 
output
+catdoc \- reads MS-Word file and puts its content as plain text on \
+standard output
 .SH SYNOPSIS
 
-.BR catdoc " [" -vlu8btawxV "] [" -m " 
-.IR number ] 
-[
-.B -s
-.IR charset ] 
-[
-.B -d 
-.IR charset ] 
-[ 
-.B -f
+.BR catdoc " [" \-vlu8btawxV "] [" \-m
+.IR number ]
+.RB [ \-s
+.IR charset ]
+.RB [ \-d
+.IR charset ]
+.RB [ \-f
 .IR output-format ]
 .I file
 
@@ -22,244 +20,253 @@ catdoc \- reads MS-Word file and puts it
 behaves much like
 .BR cat (1)
 but it reads MS-Word file and produces human-readable text on standard output.
-Optionally it can use 
+Optionally it can use
 .BR latex (1)
 escape sequences for characters which have special meaning for LaTeX.
 It also makes some effort to recognize MS-Word tables, although it never
-tries to write correct headers for LaTeX tabular environment. Additional
-output formats, such is HTML can be easily defined. 
+tries to write correct headers for LaTeX tabular environment.  Additional
+output formats, such is HTML can be easily defined.
 .PP
 .B catdoc
 doesn't attempt to extract formatting information other than tables from
 MS-Word document, so different output modes means mainly that different
 characters should be escaped and different ways used to represent characters,
-missing from output charset. See CHARACTER SUBSTITUTION below 
+missing from output charset.  See CHARACTER SUBSTITUTION below
 
 .PP
 .B catdoc
-uses internal 
+uses internal
 .BR unicode (4)
 representation of text, so it is able to convert texts when charset in
 source document doesn't match charset on target system.
 See CHARACTER SETS below.
 .PP
-If no file names supplied, 
+If no file names supplied,
+.B catdoc
+processes its standard input unless it is terminal.  It is unlikely that
+somebody could type Word document from keyboard, so if
 .B catdoc
-processes its standard input unless it is terminal. It is unlikely that 
-somebody could type Word document from keyboard, so if 
-.B catdoc 
 invoked without arguments and stdin is not redirected, it prints brief
-usage message and exits. 
+usage message and exits.
 Processing of standard input (even among other files) can be forced using
 dash '-' as file name.
 .PP
-By default, 
+By default,
 .B catdoc
 wraps lines which are more than 72 chars long and separates paragraphs by
-blank lines. This behavior can be turned of by 
-.B -w
-switch. In 
+blank lines.  This behavior can be turned of by
+.B \-w
+switch.  In
 .I wide
-mode 
-.B  catdoc prints each paragraph as one long line, suitable for import into
+mode
+.B catdoc prints each paragraph as one long line, suitable for import into
 word processors that perform word wrapping.
- 
+
 
 .SH OPTIONS
 .TP 8
-.B -a 
-- shortcut for -f ascii. Produces ASCII text as output.
-Separates table columns with TAB
+.B \-a
+\(en shortcut for \-f ascii.  Produces ASCII text as output.
+Separates table columns with TAB.
 .TP 8
-.B -b
-- process broken MS-Word file. Normally,
+.B \-b
+\(en process broken MS-Word file.  Normally,
 .B catdoc checks if first 8 bytes
-of file is Microsoft OLE signature. If so, it processes file, otherwise
-it just copies it to stdin. It is intended to use 
-.B catdoc 
-as filter for viewing all files with 
+of file is Microsoft OLE signature.  If so, it processes file, otherwise
+it just copies it to stdin.  It is intended to use
+.B catdoc
+as filter for viewing all files with
 .I .doc
 extension.
 .TP 8
-.BI -d charset
-- specifies destination charset name. Charset file has format described in
-CHARACTER SETS below and should have 
+.BI \-d charset
+\(en specifies destination charset name.  Charset file has format described in
+CHARACTER SETS below and should have
 .B .txt
-extension  and reside in 
-.B catdoc library directory ( ${prefix}/lib/x86_64-linux-gnu/catdoc). By 
default, current
-locale charset is used if langinfo support compiled in.
+extension and reside in
+.B catdoc library directory (${prefix}/lib/x86_64-linux-gnu/catdoc).
+By default, current locale charset is used
+if langinfo support is compiled in.
 .TP 8
-.BI -f format
-- specifies output format as described in CHARACTER SUBSTITUTION below.
+.BI \-f format
+\(en specifies output format as described in CHARACTER SUBSTITUTION below.
 .B catdoc
-comes with two output formats - ascii and tex. You can add your own if you
+comes with two output formats \(en ascii and tex.  You can add your own if you
 wish.
 .TP 8
-.B  -l
-Causes 
+.B \-l
+Causes
 .B catdoc
 to list names of available charsets to the stdout and exit successfully.
 .TP 8
-.BI -m number
-Specifies right margin for text  (default 72). 
-.B -m 0
+.BI \-m number
+Specifies right margin for text (default 72).
+.B \-m 0
 is equivalent to
-.B -w
+.B \-w
 .TP 8
-.BI -s charset
-Specifies source charset. (one used in Word document), if Word document
-doesn't contain UTF-16  text. When reading rtf documents, it is
+.BI \-s charset
+Specifies source charset, (one used in Word document), if Word document
+doesn't contain UTF-16 text.  When reading rtf documents, it is
 typically not necessary, because rtf documents contain ansicpg
-specification. But it can be set wrong by Word (I've seen RTF documents
-on Russian, where cp1252 was specified). In this case this option would
-take precedence over charset, specified in the document. But
+specification.  But it can be set wrong by Word (I've seen RTF documents
+on Russian, where cp1252 was specified).  In this case this option would
+take precedence over charset, specified in the document.  But
 source_charset statement in the configuration file have less priority
 than charset in the document.
 .TP 8
-.B -t
-- shortcut for 
-.B -f tex
- converts all printable chars, which have special meaning for 
+.B \-t
+\(en shortcut for
+.B \-f tex.
+Converts all printable chars, which have special meaning for
 .BR LaTeX (1)
-into appropriate control sequences. Separates table columns by 
-.BR &.
+into appropriate control sequences.  Separates table columns by
+.BR & .
 .TP 8
-.B -u
-- declares that Word  document  contain  UNICODE   (UTF-16) representation
-of text (as some Word-97 documents). If catdoc fails to correct  Word document
-with  default charset,   try    this  option.
+.B \-u
+\(en declares that Word document contain UNICODE (UTF-16) representation
+of text (as some Word-97 documents).  If catdoc fails to correct Word document
+with default charset, try this option.
 .TP 8
-.B -8
-- declares is Word document is 8 bit. Just in case that catdoc
- recognizes file format incorrectly.
+.B \-8
+\(en declares that Word document is 8 bit.  Just in case that catdoc
+recognizes file format incorrectly.
 .TP 8
-.B -w
-disables word wrapping. By default 
+.B \-w
+disables word wrapping.  By default
 .B catdoc
-output is split into lines not longer than 72 (or  number, specified by
--m  option)   characters and paragraphs
-are separated by blank line. With this option each paragraph is one
-long line. 
+output is split into lines not longer than 72 (or number, specified by
+\-m option) characters and paragraphs
+are separated by blank line.  With this option each paragraph is one
+long line.
 .TP 8
-.B -x
-causes catdoc to output unknown UNICODE character as \\xNNNN, instead
+.B \-x
+causes catdoc to output unknown UNICODE character as \exNNNN, instead
 of question marks.
 .TP 8
-.B -v
+.B \-v
 causes catdoc to print some useless information about word document
 structure to stdout before actual start of text.
 .TP 8
-.B -V
+.B \-V
 outputs catdoc version
 
 .SH CHARACTER SETS
-When processing MS-Word file 
+When processing MS-Word file
 .B catdoc
 uses information about two character sets, typically different
- -  input and output. They are stored in plain text files in 
+\(en input and output.  They are stored in plain text files in
 .B catdoc
-library directory. Character set files should contain two whitespace-separated
-hexadecimal numbers - 8-bit code in character set and 16-bit Unicode code.
+library directory.  Character set files should contain two whitespace-separated
+hexadecimal numbers \(en 8-bit code in character set and 16-bit Unicode code.
 Anything from hash mark to end of line is ignored, as well as blank lines.
 
-.B catdoc 
-distribution includes some of these character sets. Additional character set
-definitions, directly usable by 
-.B catdoc 
-can be obtained from ftp.unicode.org. Charset files have
+.B catdoc
+distribution includes some of these character sets.  Additional character set
+definitions, directly usable by
+.B catdoc
+can be obtained from ftp.unicode.org.  Charset files have
 .B .txt
 suffix, which shouldn't be specified in command-line or configuration
-files.  
+files.
 .PP
 Note that
-.B catdoc 
-is distributed with Cyrillic charsets as default. If you are not
-Russian, you probably don't want it, an should reconfigure catdoc at 
+.B catdoc
+is distributed with Cyrillic charsets as default.  If you are not
+Russian, you probably don't want it, an should reconfigure catdoc at
 compile time or in runtime configuration file.
 .PP
 When dealing with documents with charsets other than default, remember
-that Microsoft never uses ISO charsets. While letters in, say cp1252 are
+that Microsoft never uses ISO charsets.  While letters in, say cp1252 are
 at the same position as in ISO-8859-1, some punctuation signs would be
-lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
+lost, if you specify ISO-8859-1 as input charset.  If you use cp1252,
 catdoc would deal with those signs as described in CHARACTER
 SUBSTITUTION below.
 
-.SH CHARACTER SUBSTITUTION 
+.SH CHARACTER SUBSTITUTION
 .B catdoc
-converts  MS-Word file into following internal Unicode representation:
+converts MS-Word file into following internal Unicode representation:
 .TP 4
-1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
+1.
+Paragraphs are separated by ASCII Line Feed symbol (0x000A)
 .TP 4
-2. Table cells within row are separated by ASCII Field Separator symbol
+2.
+Table cells within row are separated by ASCII Field Separator symbol
 (0x001C)
 .TP 4
-3. Table rows are separated by ASCII Record Separator (0x001E) 
+3.
+Table rows are separated by ASCII Record Separator (0x001E)
 .TP 4
-4. All printable characters, including whitespace are represented with their
+4.
+All printable characters, including whitespace are represented with their
 respective UNICODE codes.
-.PP 
+.PP
 This UNICODE representation is subsequently converted into 8-bit text in
 target character set using following four-step algorithm:
 .TP 4
-1. List of special characters is searched for given Unicode character.
+1.
+List of special characters is searched for given Unicode character.
 If found, then appropriate multi-character sequence is output instead of
-character. 
+character.
 .TP 4
-2. If there is an equivalent in target character set, it is output.
+2.
+If there is an equivalent in target character set, it is output.
 .TP 4
-3. Otherwise, replacement list is searched and, if there is multi-character
+3.
+Otherwise, replacement list is searched and, if there is multi-character
 substitution for this UNICODE char, it is output.
 .TP 4
-4. If all above fails, "Unknown char" symbol (question mark) is output.
+4.
+If all above fails, "Unknown char" symbol (question mark) is output.
 .PP
 Lists of special characters and list of substitution are character
 set-independent, because special chars should be escaped regardless of their
-existence in target character set  (usually, they are parts of US-ASCII, and
+existence in target character set (usually, they are parts of US-ASCII, and
 therefore exist in any character set) and replacement list is searched only
 for those characters, which are not found in target character set.
 .PP
 These lists are stored in
-.B catdoc 
-library directory in files with prefix of format name. These files have
+.B catdoc
+library directory in files with prefix of format name.  These files have
 following format:
 .PP
 Each line can be either comment (starting with hash mark) or contain
 hexadecimal UNICODE value, separated by whitespace from string, which
-would be substituted instead of it. If string contain no whitespace it 
+would be substituted instead of it.  If string contain no whitespace it
 can be used as is, otherwise it should be enclosed in single or double
-quotes. Usual backslash sequences like 
+quotes.  Usual backslash sequences like
 .IR '\en' , '\et'
 can be used in these string.
 
 
 .SH RUNTIME CONFIGURATION
 Upon startup catdoc reads its system-wide configuration file (
-.B catdocrc in 
+.B catdocrc in
 .B catdoc
 library directory) and then
 user-specific configuration file
-.BR ${HOME}/.catdocrc.
+.BR ${HOME}/.catdocrc .
 .PP
 These files can contain following directives:
 .TP 8
 .BI "source_charset = " charset-name
-Sets default source charset, which would be used if no 
-.B -s
-option specified. Consult configuration of nearby windows
+Sets default source charset, which would be used if no
+.B \-s
+option specified.  Consult configuration of nearby windows
 workstation to find one you need.
 .TP 8
-.BI "target_charset = "  charset-name
- Sets default output charset. You probably know, which one you use.
+.BI "target_charset = " charset-name
+Sets default output charset.  You probably know, which one you use.
 .TP 8
-.BI "charset_path = "  directory-list
+.BI "charset_path = " directory-list
 colon-separated list of directories, which are searched for charset files.
 This allows you to install additional charsets in your home directory.
 If first directory component of path is ~ it is replaced by contents of
-.B HOME 
+.B HOME
 environment variable.
 On MS-DOS platform, if directory name starts with %s, it is replaced
-with directory of executable file. Empty element in list (i.e. two
-consequitve colons) is considered current directory.
+with directory of executable file.  Empty element in list (i.e., two
+consequitive colons) is considered current directory.
 .TP 8
 .BI "map_path = " directory-list
 colon-separated list of directories, which are searched for special character
@@ -271,32 +278,32 @@ are applied.
 .BI "format = " "format name"
 Output format which would be used by default.
 .B catdoc
-comes with two formats - 
+comes with two formats \(en
 .BR ascii " and " tex
-but nothing prevents you from writing your own format (set two map files -
+but nothing prevents you from writing your own format (set two map files \(en
 special character map and replacement map).
 .TP 8
 .BI "unknown_char = " "character specification"
 sets character to output instead of unknown Unicode character (default '?')
-Character specification can have one of two form - character enclosed in
+Character specification can have one of two form \(en character enclosed in
 single quotes or hexadecimal code.
 .TP 8
-.BI "use_locale =" "(yes|no)"
-Enables or disables automatic selection of output charset (default 
+.BI "use_locale =" "(yes\^|\^no)"
+Enables or disables automatic selection of output charset (default
 .BR yes ),
- based on
-system locale settings (if enabled at compile time). If automatic
+based on
+system locale settings (if enabled at compile time).  If automatic
 detection is enabled, than output charset settings in the configuration
 files (but not in the command line) are ignored, and current system
-locale charset is used instead. There are no automatic choice of input
+locale charset is used instead.  There are no automatic choice of input
 charset, based of locale language, because most modern Word files (since
 Word 97) are Unicode anyway
 
 .SH BUGS
 
 Doesn't handle
-fast-saves properly. Prints footnotes as separate paragraphs at the end of
-file, instead of producing correct LaTeX commands. Cannot distinguish
+fast-saves properly.  Prints footnotes as separate paragraphs at the end of
+file, instead of producing correct LaTeX commands.  Cannot distinguish
 between empty table cell and end of table row.
 
 

Reply via email to