At 2021-05-15T23:39:31+0200, Oliver Corff wrote: > Dear All, > > I try to use the correct abbreviation for the former Czechoslovak > Socialist Republic, which is ČSSR (C + hacek, caron, wedge). The first > attempt (enter Unicode 0x010C directly, leaving everything to > preconv(1), did not work. Then I consulted groff_char(7) but there is > no predefined \[vC], only \[vS] etc. for base letters s, S, z and Z. > No C! I keep scratching my head. > > None of the other suggested notations (like \[u0043_030C] work (see > groff(7)) out of the box.
As noted in my previous reply just a little while ago[1], I believe this is because the font does not have coverage for one of the constituent elements of this character composition sequence, so it can't render the result accurately. > The GNU groff online manual > (https://www.gnu.org/software/groff/manual/groff.html) offers an > escape route with the following request: > > .AM > > A Berkeley extension which enables extended accent marks. This feature is specific to the ms macro package. You have to be using the ms macros to use it. You could certainly crib the implementation (respecting the GPL, of course ;-) ) for other sorts of documents; I will discuss the limitations of such a course at some length below. > For months, I thought this is a font setup question, and experimented > with numerous fonts, to no avail. Only today I discovered the .AM > request. It is contained in the online manual but I could not locate > it in any of the groff man pages (perhaps a version issue? I currently > run 1.22.4). It is documented in the 1.22.4 version of groff_ms(7), but it's buried pretty deeply and mentioned only in passing. DIFFERENCES FROM troff ms [...] Text Settings [...] Improved accent marks (as originally defined in Berkeley's ms version) are available by specifying the AM macro at the beginning of your document. You can place an accent over most characters by specifying the string defining the accent directly after the character. For example, n\*~ produces an n with a tilde over it. I believe the reason AM, and the earler AT&T ms accent mark placement mechanism documented in Lesk's 1978 "Typing Documents on the UNIX System" paper, aren't documented more prominently is because they were showing their age even by the 1990s. They are smarter than dumb overstriking but they aren't adaptable to the stacking of multiple accent marks in the same general direction relative to the character. Practically speaking, you'd never get away with typesetting Vietnamese (despite its being, fundamentally, a Latin script), nor even the less demanding Hanyu Pinyin, with ms's AM macro. (Our preprocessor for Pinyin input, gpinyin(1), struggles with a similar issue[2]--I have a solution sketched but not fully tested.) Something not mentioned in the above but noted in Tuthill's Berkeley ms document[3] is that the AT&T accent mark strings are placed _before_ the character they modify, but after the base character when .AM accents are used. > Perhaps I searched the wrong man page? Neither groff(1), > groff(7), groff_char(7) nor groff_font(5) mention the .AM request; > groff(7) in its subsection "Unicode Characters" states > > > The extended escape u allows the inclusion of all available Unicode > > characters into a roff file. > It then mentions that the Unicode input conventions work for the > -Tutf8 device: > > > The availability of the Unicode characters depends on the font used. > > For text mode, the device -Tutf8 is quite complete; for troff modes > > it might happen that some or many characters will not be displayed. > > Please check your fonts. > > > but that's not what I need. Astonishingly, .AM and the -Tutf8 device > seem incompatible, so it is > > PDF: request .AM > > XOR > > -Tutf8 (but don't request .AM) > > Interesting. Yes, this is basically the shape of my workaround hack in the gpinyin(1) man page[4] (the same one I mentioned in my previous mail to this list). Right now, my understanding is that conditionalizing the input on ".if t" or ".if n" is all that should be necessary--you're either going to a terminal or you're not. Our HTML output driver thinks it's a troff device but I don't know what it does--it _should_, I think, emit HTML character entities &likethis;, but I don't know it handles special character composition sequences of either form (regular: \[a aa]; Unicode: (\[u0061_0301]). I regret to say I don't spend a lot of time looking at groff's HTML output. At 2021-05-17T15:47:02+0200, Oliver Corff wrote: > Anyway, for my purpose .AM solves the problem. Is it possible to > include that in the man pages of the groff system? I only found in > online, as indicated in my original post. I don't think we can recommend .AM to everybody in clear conscience; for one thing, it's limited to the ms macro package. For another, its underlying implementation is, as I said above, too crude for general character composition in the post-ISO-8859 era[5]. For our ms documentation, this issue was already on my radar--I simply haven't gotten around to jumping up and down on that section of ms.ms and its counterparts in our Texinfo manual and groff_ms(7) yet. More broadly, I reckon I should adapt my brain dump from this thread into something coherent and clear for our groff(7) page. That's a challenge, though, because there's a whole lot I don't know about digital font technology. I don't even know how composition of multiple characters is even done in them. I've read Unicode materials about this stuff but I don't know anything about real-world implementations. Maybe people on this list can help. Regards, Branden [1] https://lists.gnu.org/archive/html/groff/2021-05/msg00045.html [2] https://savannah.gnu.org/bugs/index.php?57524 [3] https://minnie.tuhs.org/cgi-bin/utree.pl?file=4.2BSD/usr/doc/msmacros/ms.diffs [4] https://lists.gnu.org/archive/html/groff-commit/2021-05/msg00062.html [5] Here are some gory details drawn straight from our tmac/s.tmac file (the ms macro package). Here's how we implemented AT&T-compatible accent marks as documented in Lesk 1978. .de acc*prefix-def .ds \\$1 \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2' .. .acc*prefix-def ' \' .acc*prefix-def ` \` .acc*prefix-def ^ ^ .acc*prefix-def , \(ac .acc*prefix-def : \(ad .acc*prefix-def ~ ~ And here's our implementation of 4.2BSD's .AM. .de acc*over-def .ds \\$1 \Z'\v'(u;\w'x'*0+\En[rst]-\En[.cht])'\ \h'(u;-\En[skw]+(-\En[.w]-\w'\\$2'/2)+\En[.csk])'\\$2' .. .de acc*under-def .ds \\$1 \Z'\v'\En[.cdp]u'\h'(u;-\En[.w]-\w'\\$2'/2)'\\$2' .. .de acc*slash-def .ds \\$1 \Z'\h'(u;-\En[.w]-\w'\\$2'/2)'\ \v'(u;\En[.cdp]-\En[.cht]+\En[rst]+\En[rsb]/2)'\\$2' .. .\" improved accent marks .de AM .acc*over-def ' \' .acc*over-def ` \` .acc*over-def ^ ^ .acc*over-def ~ ~ .acc*over-def : \(ad .acc*over-def v \(ah .acc*over-def _ \(a- .acc*over-def o \(ao .acc*under-def , \(ac .acc*under-def . \s[\En[.s]*8u/10u]\v'.2m'.\v'-.2m'\s0 .acc*under-def hook \(ho .acc*slash-def / / [...snip...] .. The above is pretty dense stuff but if you keep a copy of groff(7) open in adjacent window to decode the escape sequences it is decipherable. You also have to know a groff extension to *roff expression syntax. (c;e) Evaluate e using c as the default scaling indicator. For instance, in the above we see .ds \\$1 \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2' Here's how I break that line noise down. .ds means define a string. For instance, ".ds foo bar" defines a string named "foo" which interpolates the input sequence "bar". So "\*[foo]" is replaced by "bar" in the input. Here we're giving the ds request a positional parameter as the string name. We could thus write a macro that wraps the ds request. .de MAKESTRING . ds \\$1 \\$2 . tm Look, ma! I defined a string called \\$1 containing "\\$2"! .. The backslashes are doubled because macro definitions are read in "copy mode", which means that most input not stored instead of immediately interpreted. However, some of the most commonly used escape sequences are interpreted immediately anyway, including parameter, register, and string interpolations (\$, \n, \*), so you have to "protect" them from interpretation by preceding them with an extra backslash. Or, in groff, you can use \E, as aslso seen above which is an "uninterpreted" escape character[6]. .ds \\$1 \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2' So this defines a string of the name given as the first parameter to the macro. Now let's tackle that rather terrifying right hand side. \Z'\h'(u;\w'x'-\w'\\$2'/2)'\\$2' The next thing to know is that some escape sequences take a user-specified delimiter character. All of \Z, \h, and \w have this property. And the next thing to know after that is that groff keeps track of the "input level", which we can think of as the "nesting depth". This is unlike the Unix shell or most lexical analyzers for C, for example. Armed with that knowledge, we can see that this is a nested construct. \Z' \h' (u;\w' x ' - \w' \\$2 ' /2) '\\$2 ' Broken into small pieces like that, it's easier to see what's going on. At the innermost level we're evaluating an arithmetic expression with a default scaling indicator of "u"--"basic units" in groff parlance. Here's the meaning of these escapes. \Z'anything' interprets the string "anything" without performing any of the default horizontal motions that ordinarily accompany the glyphs inside of it. It's like \z from Unix troff but applies to every input character within the delimeters. So \za\zb\zc and \Z'abc' are equivalent. \h'dist' performs a horizontal motion of the amount DIST. \w'anything' measures the horizontal width of "anything" as if it were rendered in the current environment "normally"; that is, it doesn't care about the \Z at an enclosing input level. Example: $ nroff | cat -s \w'foobar' \w'\Z'foobar'' <CTRL-D> 144 0 Putting all this together, we can see that acc*prefix-def defines a string named for its first parameter that does not advance the character position, and which outputs its second parameter combined with a horizontal motion of the width of an 'x'[7] minus half the width of the second parameter. This arithmetic is applied no matter what \$2 is, so you can imagine that there might be accent marks for which it works poorly, like U+0315 COMBINING COMMA ABOVE RIGHT, but such accents were, I reckon, little-known in the 1970s, and not attested in any character encoding standard I'm aware of. Corrections to any of the above are, as always, welcome. [6] Analogous to the "uninterpreted leader" and "uninterpreted tab" characters \a and \t. Since they're configurable, you don't necessarily know what they're going to be when defining a macro or are otherwise in copy mode. It took me ages to even begin to understand this stuff, which is why I keep rewriting our documentation. [7] "in the current font" and "in the current environment", yadda yadda
signature.asc
Description: PGP signature