Re: Not predefined Extended Latin character needed, interesting solution found

G. Branden Robinson Wed, 19 May 2021 01:32:44 -0700

At 2021-05-15T23:39:31+0200, Oliver Corff wrote:
> Dear All,
> 
> I try to use the correct abbreviation for the former Czechoslovak
> Socialist Republic, which is ČSSR (C + hacek, caron, wedge). The first
> attempt (enter Unicode 0x010C directly, leaving everything to
> preconv(1), did not work. Then I consulted groff_char(7) but there is
> no predefined \[vC], only \[vS] etc. for base letters s, S, z and Z.
> No C!  I keep scratching my head.
> 
> None of the other suggested notations (like \[u0043_030C] work (see
> groff(7)) out of the box.

As noted in my previous reply just a little while ago[1], I believe this
is because the font does not have coverage for one of the constituent
elements of this character composition sequence, so it can't render the
result accurately.

> The GNU groff online manual
> (https://www.gnu.org/software/groff/manual/groff.html) offers an
> escape route with the following request:
> 
> .AM
> 
> A Berkeley extension which enables extended accent marks.

This feature is specific to the ms macro package.  You have to be using
the ms macros to use it.  You could certainly crib the implementation
(respecting the GPL, of course ;-) ) for other sorts of documents; I
will discuss the limitations of such a course at some length below.

> For months, I thought this is a font setup question, and experimented
> with numerous fonts, to no avail. Only today I discovered the .AM
> request. It is contained in the online manual but I could not locate
> it in any of the groff man pages (perhaps a version issue? I currently
> run 1.22.4).

It is documented in the 1.22.4 version of groff_ms(7), but it's buried
pretty deeply and mentioned only in passing.

        DIFFERENCES FROM troff ms
[...]
           Text Settings
[...]
                Improved accent marks (as originally defined in
                Berkeley's ms version) are available by specifying the
                AM macro at the beginning of your document.   You can
                place an accent over most characters by specifying the
                string defining the accent directly after the character.
                For example, n\*~ produces an n with a tilde over it.

I believe the reason AM, and the earler AT&T ms accent mark placement
mechanism documented in Lesk's 1978 "Typing Documents on the UNIX
System" paper, aren't documented more prominently is because they were
showing their age even by the 1990s.  They are smarter than dumb
overstriking but they aren't adaptable to the stacking of multiple
accent marks in the same general direction relative to the character.

Practically speaking, you'd never get away with typesetting Vietnamese
(despite its being, fundamentally, a Latin script), nor even the less
demanding Hanyu Pinyin, with ms's AM macro.  (Our preprocessor for
Pinyin input, gpinyin(1), struggles with a similar issue[2]--I have a
solution sketched but not fully tested.)

Something not mentioned in the above but noted in Tuthill's Berkeley ms
document[3] is that the AT&T accent mark strings are placed _before_ the
character they modify, but after the base character when .AM accents are
used.

> Perhaps I searched the wrong man page? Neither groff(1),
> groff(7), groff_char(7) nor groff_font(5) mention the .AM request;
> groff(7) in its subsection "Unicode Characters" states
> 
> > The extended escape u allows the inclusion of all available Unicode
> > characters into a roff file.
> It then mentions that the Unicode input conventions work for the
> -Tutf8 device:
> 
> > The availability of the Unicode characters depends on the font used.
> > For text mode, the device -Tutf8 is quite complete; for troff modes
> > it might happen that some or many characters will not be displayed.
> > Please check your fonts.
> > 
> but that's not what I need. Astonishingly, .AM and the -Tutf8 device
> seem incompatible, so it is
> 
> PDF: request .AM
> 
> XOR
> 
> -Tutf8 (but don't request .AM)
> 
> Interesting.

Yes, this is basically the shape of my workaround hack in the gpinyin(1)
man page[4] (the same one I mentioned in my previous mail to this list).

Right now, my understanding is that conditionalizing the input on ".if
t" or ".if n" is all that should be necessary--you're either going to a
terminal or you're not.  Our HTML output driver thinks it's a troff
device but I don't know what it does--it _should_, I think, emit HTML
character entities &likethis;, but I don't know it handles special
character composition sequences of either form (regular: \[a aa];
Unicode: (\[u0061_0301]).  I regret to say I don't spend a lot of time
looking at groff's HTML output.

At 2021-05-17T15:47:02+0200, Oliver Corff wrote:
> Anyway, for my purpose .AM solves the problem. Is it possible to
> include that in the man pages of the groff system? I only found in
> online, as indicated in my original post.

I don't think we can recommend .AM to everybody in clear conscience; for
one thing, it's limited to the ms macro package.  For another, its
underlying implementation is, as I said above, too crude for general
character composition in the post-ISO-8859 era[5].

For our ms documentation, this issue was already on my radar--I simply
haven't gotten around to jumping up and down on that section of ms.ms
and its counterparts in our Texinfo manual and groff_ms(7) yet.

More broadly, I reckon I should adapt my brain dump from this thread
into something coherent and clear for our groff(7) page.  That's a
challenge, though, because there's a whole lot I don't know about
digital font technology.  I don't even know how composition of multiple
characters is even done in them.  I've read Unicode materials about this
stuff but I don't know anything about real-world implementations.

Maybe people on this list can help.

Regards,
Branden

[1] https://lists.gnu.org/archive/html/groff/2021-05/msg00045.html
[2] https://savannah.gnu.org/bugs/index.php?57524
[3] 
https://minnie.tuhs.org/cgi-bin/utree.pl?file=4.2BSD/usr/doc/msmacros/ms.diffs
[4] https://lists.gnu.org/archive/html/groff-commit/2021-05/msg00062.html
[5] Here are some gory details drawn straight from our tmac/s.tmac file
    (the ms macro package).

Here's how we implemented AT&T-compatible accent marks as documented in
Lesk 1978.

.de acc*prefix-def
.ds \\$1 \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2'
..
.acc*prefix-def ' \'
.acc*prefix-def ` \`
.acc*prefix-def ^ ^
.acc*prefix-def , \(ac
.acc*prefix-def : \(ad
.acc*prefix-def ~ ~

And here's our implementation of 4.2BSD's .AM.

.de acc*over-def
.ds \\$1 \Z'\v'(u;\w'x'*0+\En[rst]-\En[.cht])'\
\h'(u;-\En[skw]+(-\En[.w]-\w'\\$2'/2)+\En[.csk])'\\$2'
..
.de acc*under-def
.ds \\$1 \Z'\v'\En[.cdp]u'\h'(u;-\En[.w]-\w'\\$2'/2)'\\$2'
..
.de acc*slash-def
.ds \\$1 \Z'\h'(u;-\En[.w]-\w'\\$2'/2)'\
\v'(u;\En[.cdp]-\En[.cht]+\En[rst]+\En[rsb]/2)'\\$2'
..

.\" improved accent marks
.de AM
.acc*over-def ' \'
.acc*over-def ` \`
.acc*over-def ^ ^
.acc*over-def ~ ~
.acc*over-def : \(ad
.acc*over-def v \(ah
.acc*over-def _ \(a-
.acc*over-def o \(ao
.acc*under-def , \(ac
.acc*under-def . \s[\En[.s]*8u/10u]\v'.2m'.\v'-.2m'\s0
.acc*under-def hook \(ho
.acc*slash-def / /
[...snip...]
..

The above is pretty dense stuff but if you keep a copy of groff(7) open
in adjacent window to decode the escape sequences it is decipherable.
You also have to know a groff extension to *roff expression syntax.

  (c;e)     Evaluate e using c as the default scaling indicator.

For instance, in the above we see

  .ds \\$1 \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2'

Here's how I break that line noise down.

.ds means define a string.  For instance, ".ds foo bar" defines a string
named "foo" which interpolates the input sequence "bar".  So "\*[foo]"
is replaced by "bar" in the input.

Here we're giving the ds request a positional parameter as the string
name.  We could thus write a macro that wraps the ds request.

.de MAKESTRING
.  ds \\$1 \\$2
.  tm Look, ma!  I defined a string called \\$1 containing "\\$2"!
..

The backslashes are doubled because macro definitions are read in "copy
mode", which means that most input not stored instead of immediately
interpreted.  However, some of the most commonly used escape sequences
are interpreted immediately anyway, including parameter, register, and
string interpolations (\$, \n, \*), so you have to "protect" them from
interpretation by preceding them with an extra backslash.  Or, in groff,
you can use \E, as aslso seen above which is an "uninterpreted" escape
character[6].

  .ds \\$1 \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2'

So this defines a string of the name given as the first parameter to the
macro.  Now let's tackle that rather terrifying right hand side.

  \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2'

The next thing to know is that some escape sequences take a
user-specified delimiter character.  All of \Z, \h, and \w have this
property.  And the next thing to know after that is that groff keeps
track of the "input level", which we can think of as the "nesting
depth".  This is unlike the Unix shell or most lexical analyzers for C,
for example.  Armed with that knowledge, we can see that this is a
nested construct.

  \Z'
     \h'
        (u;\w'
              x
             '
           -
           \w'
              \\$2
             '
           /2)
       '\\$2
    '

Broken into small pieces like that, it's easier to see what's going on.
At the innermost level we're evaluating an arithmetic expression with a
default scaling indicator of "u"--"basic units" in groff parlance.

Here's the meaning of these escapes.

  \Z'anything' interprets the string "anything" without performing any
  of the default horizontal motions that ordinarily accompany the glyphs
  inside of it.  It's like \z from Unix troff but applies to every input
  character within the delimeters.  So \za\zb\zc and \Z'abc' are
  equivalent.

  \h'dist' performs a horizontal motion of the amount DIST.

  \w'anything' measures the horizontal width of "anything" as if it were
  rendered in the current environment "normally"; that is, it doesn't
  care about the \Z at an enclosing input level.

  Example:
    $ nroff | cat -s
\w'foobar'
\w'\Z'foobar''
<CTRL-D>
144 0

Putting all this together, we can see that acc*prefix-def defines a
string named for its first parameter that does not advance the character
position, and which outputs its second parameter combined with a
horizontal motion of the width of an 'x'[7] minus half the width of the
second parameter.  This arithmetic is applied no matter what \$2 is, so
you can imagine that there might be accent marks for which it works
poorly, like U+0315 COMBINING COMMA ABOVE RIGHT, but such accents were,
I reckon, little-known in the 1970s, and not attested in any character
encoding standard I'm aware of.

Corrections to any of the above are, as always, welcome.

[6] Analogous to the "uninterpreted leader" and "uninterpreted tab"
    characters \a and \t.  Since they're configurable, you don't
    necessarily know what they're going to be when defining a macro or
    are otherwise in copy mode.  It took me ages to even begin to
    understand this stuff, which is why I keep rewriting our
    documentation.
[7] "in the current font" and "in the current environment", yadda yadda

signature.asc
Description: PGP signature

Re: Not predefined Extended Latin character needed, interesting solution found

Reply via email to