On Wed, Oct 25, 2023 at 05:03:36AM -0500, G. Branden Robinson wrote: > Hi Walter & Dave, > > At 2023-09-11T19:45:30+0200, Walter Alejandro Iglesias wrote: > > If instead of sourcing hyphen.tr from my macros with .mso I source it > > directly from the roff document with .so those error messages > > desapear. > > As Dave mentioned, this is explained by soelim(1) not being run on the > "macro sourced" file. As a rule, I think files to be read with the > `mso` request should be in plain ASCII only. The whole point of a macro > file suitable for general use is that it...gets used generally, which > means that documents employing a variety of input encodings might employ > it. You therefore should use the lowest common denominator character > encoding for it: ASCII. (Strictly, ISO 646:1991-IRV.) > > That doesn't mean you have to do much more work or spend a lot of time > staring at groff_char(7) and learning the special character identifiers > for the upper half of ISO 8859-1. You can still have your macro sourced > file in Latin-1; just run preconv over it stand-alone as a converter. > > $ printf '.ds aunt la t\\355a\n' > family.mso.in > $ preconv -e latin1 family.mso.in > family.mso > > Part of the preconv(1) man page is likely worth reviewing. > > iconv support > [...] > The use of iconv means that characters in the input that encode > invalid code points for that encoding may be dropped from the > output stream or mapped to the Unicode replacement character > (U+FFFD). Compare the following examples using the input “café” > (note the “e” with an acute accent), which due to its short > length challenges inference of the encoding used. > printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv > printf 'caf\351\n' | preconv -e us-ascii > printf 'caf\351\n' | preconv -e latin-1 > The fate of the accented “e” differs in each case. In the first, > uchardet fails to detect an encoding (though the library on your > system may behave differently) and preconv falls back to the > locale settings, where octal 351 starts an incomplete UTF‐8 > sequence and results in the Unicode replacement character. In > the second, it is not a representable character in the declared > input encoding of US‐ASCII and is discarded by iconv. In the > last, it is correctly detected and mapped. > [...] > Limitations > preconv cannot perform any transformation on input that it cannot > see. Examples include files that are interpolated by > preprocessors that run subsequently, including soelim(1); files > included by troff itself through “so” and similar requests; and > string definitions passed to troff through its -d command‐line > option. > > Maybe I should add my adminition above about macro-sourced files to this > man page. > > At 2023-09-12T11:16:58+0200, Walter Alejandro Iglesias wrote: > > I cleaned up a bit the quoted text to make room for the following. Here > > we go: > > > > $ uname -a > > Linux bell 6.4.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.13-1 > > (2023-08-31) x86_64 GNU/Linux > > $ groff --version | head -1 > > GNU groff version 1.23.0 > > $ mkdir test > > $ cd test > > $ cat << EOF > doc.tr > > .mso list.tr > > EOF > > $ cat << EOF > list.tr > > .hw a-hí > > .hw a-ño > > .hw ár-bol > > .hw cu-brí-a > > .hw e-té-re-o > > .hw ca-mión > > .hw ú-te-ro > > .hw pin-güi-no > > EOF > > $ GROFF_TMAC_PATH=. nroff doc.tr > > troff:./list.tr:1: error: expected ordinary or special character, got an > > escaped '%' > > troff:./list.tr:4: error: expected ordinary or special character, got an > > escaped '%' > > This transcript isn't as useful as it could be, because it didn't > disclose to me what character encoding was used for list.tr on the file > system. Running the file(1) command on it and sharing that would help.
I think I said it several times that list.tr is a utf-8 file. And I wouldn't trust file(1) on that. > > > As you see, from the UTF-8 chars used in Spanish (á, é, í, ó, ú, ü, > > ñ), groff seems to only have problems with the 'í' in particular. > > Let's try another test using preconv(1). > > preconv is probably using iconv(3) on your system ("preconv --version" > will tell you). iconv's heuristics for guessing the encoding are opaque > to groff (and to me). In OpenBSD preconv (1.22.4) is compiled without iconv. I had to downgrade Devuan to stable, which comes with groff 1.22.4, and preconv compiled *with* iconv. I cannot reproduce the bug here. So, this has all the numbers to be a regression, in your place I'd try to figure out in with patch between 1.22.4 and the current version was introduced. I know that my bug report isn't as helpful as it could be, but right now I'm doing other things, sorry. > > > The errors remain. Finally, I told you that changing .mso request to > > .so made the error messages disappear, that's because in my Makefile I > > run soelim(1) before. Last test: > > > > $ cat << EOF > doc.tr > > .hla es > > .so list.tr \" notice here I changed the request > > Ahí, el árbol nos cubría con su sombra. > > Un pingüino pasaba caminando por la playa. > > EOF > > $ preconv -e UTF-8 doc.tr | nroff | cat -s > > troff:./list.tr:1: error: expected ordinary or special character, got an > > escaped '%' > > troff:./list.tr:3: error: expected ordinary or special character, got an > > escaped '%' > > Ahí, el árbol nos cubría con su sombra. Un pingüino pasaba cami‐ > > nando por la playa. > > $ soelim doc.tr | preconv -e UTF-8 | nroff | cat -s > > Ahí, el árbol nos cubría con su sombra. Un pingüino pasaba cami‐ > > nando por la playa. > > > > This last command throws no error, that's because soelim(1) allows > > preconv(1) to process the list.tr file. > > Right, I think that's the right strategy precisely. You can maintain > the file you want to `mso` in version control in whatever character > encoding is comfortable for you--I'd store it as an ".in" file and have > make(1) run preconv(1) over it when constructing documents that use it. > > > Anyways. My doubt comes from the fact that so far (with groff 1.22.4 > > under OpenBSD) I haven't needed to preprocess that .hw list with > > preconv, > > OpenBSD is notoriously minimalistic. You might see if `preconv > --version` there reports use of iconv...except...uh, I think revealing > that information is something I added _after_ the groff 1.22.4 release. Answered above. > > So here's another paragraph from preconv(1) that might explain the > behavior on OpenBSD. > > iconv support > While preconv recognizes all of the coding tags listed above, it > is capable on its own of interpreting only three encodings: > Latin‐1, code page 1047, and UTF‐8. If iconv support is > configured at compile time and available at run time, all others > are passed to iconv library functions, which may recognize many > additional encoding strings. The command “preconv -v” discloses > whether iconv support is configured. > > Unfortunately I don't know of an example of an encoding name that is a > reliable test for iconv support being absent. > > > and that only the 'í' (iacute) triggers the error. > > I think this might be explained by iconv(3)'s heuristic approach. > > On my system, I confirmed that nothing crazy was going on with the > following experiments. > > $ printf 'caf\351\n' | preconv -e latin1 > .lf 1 - > caf\[u00E9] > $ printf 'la t\355a\n' | preconv -e latin1 | nroff | head -n 1 > la tía > $ printf 'la t\355a\n' | nroff -K latin1 | head -n 1 > la tía > $ printf 'la t\355a\n' | nroff | head -n 1 > la tía > > At 2023-10-05T10:45:32+0200, Walter Alejandro Iglesias wrote: > > If I feed preconv with a file already in latin1 (using UTF-8 locales > > here) ... > > > > $ preconv -e utf8 list_in_latin1.tr > > > > ... *all* non ASCII characters in the output are replaced by \[uFFFD]. > > Yes, because the `-e` flag _describes the character encoding of the > input_. > > Description > preconv reads each file, converts its encoded characters to a > form troff(1) can interpret, and sends the result to the standard > output stream. > [...] > Options > [...] > -e encoding > Skip detection and assume encoding; see groff’s -K option. > > Do not try to tell preconv the desired character encoding of the > _output_; that's not its job. Its job is to normalize the input so that > GNU troff(1) can read it. > > The character encoding of the output is inapplicable to GNU troff(1) > itself; it, like all device-independent troffs, writes an ASCII-encoded > plain text file. An output driver like grotty(1) translates troff(1) > output into whatever is appropriate for the device, which is why groff's > terminal output devices are named things like "ascii", "latin1" and > "utf8". > > At 2023-10-12T16:46:07-0500, Dave Kemper wrote: > > On 10/5/23, Walter Alejandro Iglesias <w...@roquesor.com> wrote: > > > If I feed preconv with a file already in latin1 (using UTF-8 locales > > > here) ... > > > > > > $ preconv -e utf8 list_in_latin1.tr > > > > > > ... *all* non ASCII characters in the output are replaced by \[uFFFD]. > > > > Yes, this would be expected to not work. preconv's "-e" option > > specifies the *input* encoding. So if the input file is in Latin-1, > > but you tell preconv that it's in UTF-8, you'd expect things to go > > awry. > > Right. > > > But that's not the full explanation: *all* Latin-1 characters are > > multiple bytes when encoded as UTF-8. > > Strictly, Latin-1 is an 8-bit character encoding. You might say here > "all characters from the Unicode Latin-1 extension block" instead. > > Ya know, if you're a stickler. > > > So if iacute (Latin-1 0xED) is misread in the way Bjarni describes, > > the same should happen to all the other Latin-1 characters as well. > > The fact groff is treating one Latin-1 character differently from the > > others carries the whiff of a bug. > > I'm prepared to chalk this up to iconv heuristic conversion in the > absence of other information. See my attempted reproducers above. > > Regards, > Branden -- Walter