Package: catdoc Version: 1:0.95-4.1 Severity: minor Tags: patch The patch is in the attachment
Summary: Remove space at end of lines. Fix warnings from test-groff. Change - to \- if it shall be printed as a minus. Change \\ to \e to print the escape character. Change a HYPHEN-MINUS (code 0x55, 2D) to a dash (minus) if it matches " -[:alpha:]" or \(aq-[:alpha:] (for options). Add a small space around "|" to increases readability. Add a comma (or \&) after "e.g." and "i.e.", or use English words. Increase space between sentences to two space characters. Split long lines (> 80). Use \(en for a dash where appropriate. Split a punctuation from a single argument for a two-fonts marco ### Details: Input file is catdoc.1 mandoc: catdoc.1:6:38: STYLE: unterminated quoted argument mandoc: catdoc.1:6:40: STYLE: whitespace at end of input line Many such lines ####### Test nr. 2: Enable and fix warnings from 'test-groff'. <catdoc.1>:125 (macro BR): only 1 argument, but more are expected <catdoc.1>:241 (macro BR): only 1 argument, but more are expected Output is from: test-groff -b -e -mandoc -T utf8 -rF0 -t -w w -z [ "test-groff" is a developmental version of "groff" ] #### Test nr. 18: Change - to \- if it shall be printed as a minus sign. 132:.B -8 ##### Test nr. 20: Use "\e" to print the escape character instead of "\\" (which gets interpreted in copy mode). 145:causes catdoc to output unknown UNICODE character as \\xNNNN, instead ##### Test nr. 25: Change a HYPHEN-MINUS (code 0x55, 2D) to a minus (\-), if in front of a name for an option. 6:.BR catdoc " [" -vlu8btawxV "] [" -m " 9:.B -s 12:.B -d 15:.B -f 60:.B -w 70:.B -a 71:- shortcut for -f ascii. Produces ASCII text as output. 74:.B -b 84:.BI -d charset 92:.BI -f format 98:.B -l 103:.BI -m number 105:.B -m 0 107:.B -w 109:.BI -s charset 119:.B -t 121:.B -f tex 127:.B -u 136:.B -w 144:.B -x 148:.B -v 152:.B -V 247:.B -s ##### Test nr. 34: Add a space around "|" to increase readability 284:.BI "use_locale =" "(yes|no)" ##### Test nr. 40: Add a comma (or \&) after "e.g." and "i.e.", or use English words (man-pages(7) [package "manpages"]). 261:with directory of executable file. Empty element in list (i.e. two ##### Test nr. 41: Wrong distance between sentences or protect the indicator. 1) Separate the sentences and subordinate clauses; each begins on a new line. See man-pages(7) [package "manpages"] and "info groff". Or 2) Adjust space between sentences (two spaces), 3) or protect the indicator by adding "\&" after it. The "indicator" is an "end-of-sentence character" (.!?). 29:tries to write correct headers for LaTeX tabular environment. Additional 36:missing from output charset. See CHARACTER SUBSTITUTION below 48:processes its standard input unless it is terminal. It is unlikely that 59:blank lines. This behavior can be turned of by 61:switch. In 71:- shortcut for -f ascii. Produces ASCII text as output. 75:- process broken MS-Word file. Normally, 77:of file is Microsoft OLE signature. If so, it processes file, otherwise 78:it just copies it to stdin. It is intended to use 85:- specifies destination charset name. Charset file has format described in 95:comes with two output formats - ascii and tex. You can add your own if you 110:Specifies source charset. (one used in Word document), if Word document 111:doesn't contain UTF-16 text. When reading rtf documents, it is 113:specification. But it can be set wrong by Word (I've seen RTF documents 114:on Russian, where cp1252 was specified). In this case this option would 115:take precedence over charset, specified in the document. But 124:into appropriate control sequences. Separates table columns by 129:of text (as some Word-97 documents). If catdoc fails to correct Word document 133:- declares is Word document is 8 bit. Just in case that catdoc 137:disables word wrapping. By default 141:are separated by blank line. With this option each paragraph is one 159: - input and output. They are stored in plain text files in 161:library directory. Character set files should contain two whitespace-separated 166:distribution includes some of these character sets. Additional character set 169:can be obtained from ftp.unicode.org. Charset files have 176:is distributed with Cyrillic charsets as default. If you are not 181:that Microsoft never uses ISO charsets. While letters in, say cp1252 are 183:lost, if you specify ISO-8859-1 as input charset. If you use cp1252, 191:1. Paragraphs are separated by ASCII Line Feed symbol (0x000A) 193:2. Table cells within row are separated by ASCII Field Separator symbol 196:3. Table rows are separated by ASCII Record Separator (0x001E) 198:4. All printable characters, including whitespace are represented with their 204:1. List of special characters is searched for given Unicode character. 208:2. If there is an equivalent in target character set, it is output. 210:3. Otherwise, replacement list is searched and, if there is multi-character 213:4. If all above fails, "Unknown char" symbol (question mark) is output. 223:library directory in files with prefix of format name. These files have 228:would be substituted instead of it. If string contain no whitespace it 230:quotes. Usual backslash sequences like 248:option specified. Consult configuration of nearby windows 252: Sets default output charset. You probably know, which one you use. 261:with directory of executable file. Empty element in list (i.e. two 288:system locale settings (if enabled at compile time). If automatic 291:locale charset is used instead. There are no automatic choice of input 298:fast-saves properly. Prints footnotes as separate paragraphs at the end of 299:file, instead of producing correct LaTeX commands. Cannot distinguish ##### Test nr. 42: Split lines longer than 80 characters into two or more lines. Appropriate break points are the end of a sentence and a subordinate clause; after punctuation marks catdoc.1: line 3 length 82 catdoc \- reads MS-Word file and puts its content as plain text on standard output catdoc.1: line 89 length 89 .B catdoc library directory ( ${prefix}/lib/x86_64-linux-gnu/catdoc). By default, current ##### Test nr. 44: Use \(en for a dash (en-dash) between space characters, not a minus (\-) or a hyphen (-), except in the NAME section. 71:- shortcut for -f ascii. Produces ASCII text as output. 75:- process broken MS-Word file. Normally, 85:- specifies destination charset name. Charset file has format described in 93:- specifies output format as described in CHARACTER SUBSTITUTION below. 95:comes with two output formats - ascii and tex. You can add your own if you 120:- shortcut for 128:- declares that Word document contain UNICODE (UTF-16) representation 133:- declares is Word document is 8 bit. Just in case that catdoc 159: - input and output. They are stored in plain text files in 162:hexadecimal numbers - 8-bit code in character set and 16-bit Unicode code. 274:comes with two formats - 276:but nothing prevents you from writing your own format (set two map files - 281:Character specification can have one of two form - character enclosed in ##### Test nr. 52: Split a punctuation from a single argument for a two-fonts marco 125:.BR &. 241:.BR ${HOME}/.catdocrc. -- System Information: Debian Release: buster/sid APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'proposed-updates'), (500, 'testing'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 4.9.110-3 (SMP w/2 CPU cores) Locale: LANG=is_IS.iso88591, LC_CTYPE=is_IS.iso88591 (charmap=ISO-8859-1), LANGUAGE=is_IS.iso88591 (charmap=ISO-8859-1) Shell: /bin/sh linked to /bin/dash Init: sysvinit (via /sbin/init) Versions of packages catdoc depends on: ii libc6 2.27-5 catdoc recommends no packages. Versions of packages catdoc suggests: pn tk | wish <none> -- no debconf information -- Bjarni I. Gislason
--- catdoc.1 2017-11-05 22:48:29.000000000 +0000 +++ catdoc.1.new 2018-07-24 00:37:46.000000000 +0000 @@ -1,18 +1,16 @@ -.TH catdoc 1 "Version @catdoc_version@" "MS-Word reader" +.TH catdoc 1 "Version @catdoc_version@" "MS-Word reader" .SH NAME -catdoc \- reads MS-Word file and puts its content as plain text on standard output +catdoc \- reads MS-Word file and puts its content as plain text on \ +standard output .SH SYNOPSIS -.BR catdoc " [" -vlu8btawxV "] [" -m " -.IR number ] -[ -.B -s -.IR charset ] -[ -.B -d -.IR charset ] -[ -.B -f +.BR catdoc " [" \-vlu8btawxV "] [" \-m +.IR number ] +.RB [ \-s +.IR charset ] +.RB [ \-d +.IR charset ] +.RB [ \-f .IR output-format ] .I file @@ -22,244 +20,253 @@ catdoc \- reads MS-Word file and puts it behaves much like .BR cat (1) but it reads MS-Word file and produces human-readable text on standard output. -Optionally it can use +Optionally it can use .BR latex (1) escape sequences for characters which have special meaning for LaTeX. It also makes some effort to recognize MS-Word tables, although it never -tries to write correct headers for LaTeX tabular environment. Additional -output formats, such is HTML can be easily defined. +tries to write correct headers for LaTeX tabular environment. Additional +output formats, such is HTML can be easily defined. .PP .B catdoc doesn't attempt to extract formatting information other than tables from MS-Word document, so different output modes means mainly that different characters should be escaped and different ways used to represent characters, -missing from output charset. See CHARACTER SUBSTITUTION below +missing from output charset. See CHARACTER SUBSTITUTION below .PP .B catdoc -uses internal +uses internal .BR unicode (4) representation of text, so it is able to convert texts when charset in source document doesn't match charset on target system. See CHARACTER SETS below. .PP -If no file names supplied, +If no file names supplied, +.B catdoc +processes its standard input unless it is terminal. It is unlikely that +somebody could type Word document from keyboard, so if .B catdoc -processes its standard input unless it is terminal. It is unlikely that -somebody could type Word document from keyboard, so if -.B catdoc invoked without arguments and stdin is not redirected, it prints brief -usage message and exits. +usage message and exits. Processing of standard input (even among other files) can be forced using dash '-' as file name. .PP -By default, +By default, .B catdoc wraps lines which are more than 72 chars long and separates paragraphs by -blank lines. This behavior can be turned of by -.B -w -switch. In +blank lines. This behavior can be turned of by +.B \-w +switch. In .I wide -mode -.B catdoc prints each paragraph as one long line, suitable for import into +mode +.B catdoc prints each paragraph as one long line, suitable for import into word processors that perform word wrapping. - + .SH OPTIONS .TP 8 -.B -a -- shortcut for -f ascii. Produces ASCII text as output. -Separates table columns with TAB +.B \-a +\(en shortcut for \-f ascii. Produces ASCII text as output. +Separates table columns with TAB. .TP 8 -.B -b -- process broken MS-Word file. Normally, +.B \-b +\(en process broken MS-Word file. Normally, .B catdoc checks if first 8 bytes -of file is Microsoft OLE signature. If so, it processes file, otherwise -it just copies it to stdin. It is intended to use -.B catdoc -as filter for viewing all files with +of file is Microsoft OLE signature. If so, it processes file, otherwise +it just copies it to stdin. It is intended to use +.B catdoc +as filter for viewing all files with .I .doc extension. .TP 8 -.BI -d charset -- specifies destination charset name. Charset file has format described in -CHARACTER SETS below and should have +.BI \-d charset +\(en specifies destination charset name. Charset file has format described in +CHARACTER SETS below and should have .B .txt -extension and reside in -.B catdoc library directory ( ${prefix}/lib/x86_64-linux-gnu/catdoc). By default, current -locale charset is used if langinfo support compiled in. +extension and reside in +.B catdoc library directory (${prefix}/lib/x86_64-linux-gnu/catdoc). +By default, current locale charset is used +if langinfo support is compiled in. .TP 8 -.BI -f format -- specifies output format as described in CHARACTER SUBSTITUTION below. +.BI \-f format +\(en specifies output format as described in CHARACTER SUBSTITUTION below. .B catdoc -comes with two output formats - ascii and tex. You can add your own if you +comes with two output formats \(en ascii and tex. You can add your own if you wish. .TP 8 -.B -l -Causes +.B \-l +Causes .B catdoc to list names of available charsets to the stdout and exit successfully. .TP 8 -.BI -m number -Specifies right margin for text (default 72). -.B -m 0 +.BI \-m number +Specifies right margin for text (default 72). +.B \-m 0 is equivalent to -.B -w +.B \-w .TP 8 -.BI -s charset -Specifies source charset. (one used in Word document), if Word document -doesn't contain UTF-16 text. When reading rtf documents, it is +.BI \-s charset +Specifies source charset, (one used in Word document), if Word document +doesn't contain UTF-16 text. When reading rtf documents, it is typically not necessary, because rtf documents contain ansicpg -specification. But it can be set wrong by Word (I've seen RTF documents -on Russian, where cp1252 was specified). In this case this option would -take precedence over charset, specified in the document. But +specification. But it can be set wrong by Word (I've seen RTF documents +on Russian, where cp1252 was specified). In this case this option would +take precedence over charset, specified in the document. But source_charset statement in the configuration file have less priority than charset in the document. .TP 8 -.B -t -- shortcut for -.B -f tex - converts all printable chars, which have special meaning for +.B \-t +\(en shortcut for +.B \-f tex. +Converts all printable chars, which have special meaning for .BR LaTeX (1) -into appropriate control sequences. Separates table columns by -.BR &. +into appropriate control sequences. Separates table columns by +.BR & . .TP 8 -.B -u -- declares that Word document contain UNICODE (UTF-16) representation -of text (as some Word-97 documents). If catdoc fails to correct Word document -with default charset, try this option. +.B \-u +\(en declares that Word document contain UNICODE (UTF-16) representation +of text (as some Word-97 documents). If catdoc fails to correct Word document +with default charset, try this option. .TP 8 -.B -8 -- declares is Word document is 8 bit. Just in case that catdoc - recognizes file format incorrectly. +.B \-8 +\(en declares that Word document is 8 bit. Just in case that catdoc +recognizes file format incorrectly. .TP 8 -.B -w -disables word wrapping. By default +.B \-w +disables word wrapping. By default .B catdoc -output is split into lines not longer than 72 (or number, specified by --m option) characters and paragraphs -are separated by blank line. With this option each paragraph is one -long line. +output is split into lines not longer than 72 (or number, specified by +\-m option) characters and paragraphs +are separated by blank line. With this option each paragraph is one +long line. .TP 8 -.B -x -causes catdoc to output unknown UNICODE character as \\xNNNN, instead +.B \-x +causes catdoc to output unknown UNICODE character as \exNNNN, instead of question marks. .TP 8 -.B -v +.B \-v causes catdoc to print some useless information about word document structure to stdout before actual start of text. .TP 8 -.B -V +.B \-V outputs catdoc version .SH CHARACTER SETS -When processing MS-Word file +When processing MS-Word file .B catdoc uses information about two character sets, typically different - - input and output. They are stored in plain text files in +\(en input and output. They are stored in plain text files in .B catdoc -library directory. Character set files should contain two whitespace-separated -hexadecimal numbers - 8-bit code in character set and 16-bit Unicode code. +library directory. Character set files should contain two whitespace-separated +hexadecimal numbers \(en 8-bit code in character set and 16-bit Unicode code. Anything from hash mark to end of line is ignored, as well as blank lines. -.B catdoc -distribution includes some of these character sets. Additional character set -definitions, directly usable by -.B catdoc -can be obtained from ftp.unicode.org. Charset files have +.B catdoc +distribution includes some of these character sets. Additional character set +definitions, directly usable by +.B catdoc +can be obtained from ftp.unicode.org. Charset files have .B .txt suffix, which shouldn't be specified in command-line or configuration -files. +files. .PP Note that -.B catdoc -is distributed with Cyrillic charsets as default. If you are not -Russian, you probably don't want it, an should reconfigure catdoc at +.B catdoc +is distributed with Cyrillic charsets as default. If you are not +Russian, you probably don't want it, an should reconfigure catdoc at compile time or in runtime configuration file. .PP When dealing with documents with charsets other than default, remember -that Microsoft never uses ISO charsets. While letters in, say cp1252 are +that Microsoft never uses ISO charsets. While letters in, say cp1252 are at the same position as in ISO-8859-1, some punctuation signs would be -lost, if you specify ISO-8859-1 as input charset. If you use cp1252, +lost, if you specify ISO-8859-1 as input charset. If you use cp1252, catdoc would deal with those signs as described in CHARACTER SUBSTITUTION below. -.SH CHARACTER SUBSTITUTION +.SH CHARACTER SUBSTITUTION .B catdoc -converts MS-Word file into following internal Unicode representation: +converts MS-Word file into following internal Unicode representation: .TP 4 -1. Paragraphs are separated by ASCII Line Feed symbol (0x000A) +1. +Paragraphs are separated by ASCII Line Feed symbol (0x000A) .TP 4 -2. Table cells within row are separated by ASCII Field Separator symbol +2. +Table cells within row are separated by ASCII Field Separator symbol (0x001C) .TP 4 -3. Table rows are separated by ASCII Record Separator (0x001E) +3. +Table rows are separated by ASCII Record Separator (0x001E) .TP 4 -4. All printable characters, including whitespace are represented with their +4. +All printable characters, including whitespace are represented with their respective UNICODE codes. -.PP +.PP This UNICODE representation is subsequently converted into 8-bit text in target character set using following four-step algorithm: .TP 4 -1. List of special characters is searched for given Unicode character. +1. +List of special characters is searched for given Unicode character. If found, then appropriate multi-character sequence is output instead of -character. +character. .TP 4 -2. If there is an equivalent in target character set, it is output. +2. +If there is an equivalent in target character set, it is output. .TP 4 -3. Otherwise, replacement list is searched and, if there is multi-character +3. +Otherwise, replacement list is searched and, if there is multi-character substitution for this UNICODE char, it is output. .TP 4 -4. If all above fails, "Unknown char" symbol (question mark) is output. +4. +If all above fails, "Unknown char" symbol (question mark) is output. .PP Lists of special characters and list of substitution are character set-independent, because special chars should be escaped regardless of their -existence in target character set (usually, they are parts of US-ASCII, and +existence in target character set (usually, they are parts of US-ASCII, and therefore exist in any character set) and replacement list is searched only for those characters, which are not found in target character set. .PP These lists are stored in -.B catdoc -library directory in files with prefix of format name. These files have +.B catdoc +library directory in files with prefix of format name. These files have following format: .PP Each line can be either comment (starting with hash mark) or contain hexadecimal UNICODE value, separated by whitespace from string, which -would be substituted instead of it. If string contain no whitespace it +would be substituted instead of it. If string contain no whitespace it can be used as is, otherwise it should be enclosed in single or double -quotes. Usual backslash sequences like +quotes. Usual backslash sequences like .IR '\en' , '\et' can be used in these string. .SH RUNTIME CONFIGURATION Upon startup catdoc reads its system-wide configuration file ( -.B catdocrc in +.B catdocrc in .B catdoc library directory) and then user-specific configuration file -.BR ${HOME}/.catdocrc. +.BR ${HOME}/.catdocrc . .PP These files can contain following directives: .TP 8 .BI "source_charset = " charset-name -Sets default source charset, which would be used if no -.B -s -option specified. Consult configuration of nearby windows +Sets default source charset, which would be used if no +.B \-s +option specified. Consult configuration of nearby windows workstation to find one you need. .TP 8 -.BI "target_charset = " charset-name - Sets default output charset. You probably know, which one you use. +.BI "target_charset = " charset-name +Sets default output charset. You probably know, which one you use. .TP 8 -.BI "charset_path = " directory-list +.BI "charset_path = " directory-list colon-separated list of directories, which are searched for charset files. This allows you to install additional charsets in your home directory. If first directory component of path is ~ it is replaced by contents of -.B HOME +.B HOME environment variable. On MS-DOS platform, if directory name starts with %s, it is replaced -with directory of executable file. Empty element in list (i.e. two -consequitve colons) is considered current directory. +with directory of executable file. Empty element in list (i.e., two +consequitive colons) is considered current directory. .TP 8 .BI "map_path = " directory-list colon-separated list of directories, which are searched for special character @@ -271,32 +278,32 @@ are applied. .BI "format = " "format name" Output format which would be used by default. .B catdoc -comes with two formats - +comes with two formats \(en .BR ascii " and " tex -but nothing prevents you from writing your own format (set two map files - +but nothing prevents you from writing your own format (set two map files \(en special character map and replacement map). .TP 8 .BI "unknown_char = " "character specification" sets character to output instead of unknown Unicode character (default '?') -Character specification can have one of two form - character enclosed in +Character specification can have one of two form \(en character enclosed in single quotes or hexadecimal code. .TP 8 -.BI "use_locale =" "(yes|no)" -Enables or disables automatic selection of output charset (default +.BI "use_locale =" "(yes\^|\^no)" +Enables or disables automatic selection of output charset (default .BR yes ), - based on -system locale settings (if enabled at compile time). If automatic +based on +system locale settings (if enabled at compile time). If automatic detection is enabled, than output charset settings in the configuration files (but not in the command line) are ignored, and current system -locale charset is used instead. There are no automatic choice of input +locale charset is used instead. There are no automatic choice of input charset, based of locale language, because most modern Word files (since Word 97) are Unicode anyway .SH BUGS Doesn't handle -fast-saves properly. Prints footnotes as separate paragraphs at the end of -file, instead of producing correct LaTeX commands. Cannot distinguish +fast-saves properly. Prints footnotes as separate paragraphs at the end of +file, instead of producing correct LaTeX commands. Cannot distinguish between empty table cell and end of table row.