Re: "transparent" output and throughput, demystified

G. Branden Robinson Thu, 05 Sep 2024 23:53:13 -0700

Hi Deri,

At 2024-09-05T20:31:55+0100, Deri wrote:
> On Thursday, 5 September 2024 04:15:44 BST G. Branden Robinson wrote:
> > At 2024-09-04T15:05:38-0500, Dave Kemper wrote:
> > > On Wed, Sep 4, 2024 at 11:04 AM Deri <d...@chuzzlewit.myzen.co.uk>
> > > wrote:
> > > > The example using \[u012F] is superior (in my opinion) because
> > > > it is using a single glyph the font designer intended for that
> > > > character rather than combining two glyphs that don't marry up
> > > > too well.
> > > 
> > > I agree with this opinion.
> > 
> > Me too.  I can't deny that the pre-composed Ů looks much better than
> > the constructed one.
[...]
> That is the purpose of combining diacritics, to "invent" a glyph for a
> character which does not exist in a font but if a glyph does exist for
> the composed character it seems bizarre for troff to require the
> composited glyph be used rather than the purposely defined glyph. The
> difference with pdf meta- data (which uses UTF16) is that it is not
> restricted to whatever fonts are in  the pdf, for rendering, it uses
> system fonts, so even though a particular character may be missing
> from the font in the pdf, it is extremely likely to be available
> within the system fonts.
>
> > It does happen, and as a typesetting application I think we _can_
> > expect people to try such things.  
> 
> Of course, and if a particular character combination has no unicode
> code point, then compositing in both the document and meta-data is the
> only option, but if a code point for the character combination does
> exist then this should be passed to the device driver because that
> gives the system fonts a chance to use the custom glyph.
> 
> > Ugly rendering is better than no rendering at all, and it's not our
> > job to make Okular render complex characters prettily in its
> > navigation pane.
> 
> But. this is a retrograde step, passing glyphs to device drivers in
> NCD [recte: NFD] form did not occur when text was passed to device
> drivers in copy-in mode, so we are forcing Okular (and all other pdf
> viewers I tried) to render the given text in a sub-standard way -
> which we did not before.


...hence my very next sentence.

> > That said, we _can_ throw it a bone, and it seems easy enough to do
> > so.

A more important reason to note the foregoing is your identification of
copy mode behavior as a feature that output device macro package authors
(like yourself with "pdf.tmac") will demand.  It's a point that has
consistently caused me concern, and I dislike the idea of adding a "mode
3" as you put it, but have felt steered down that path by other
considerations.  Fortunately, an alternative starts to form in my mind.

The theme of that alternative is be an addendum I would make to your
statement: copy mode is desirable for passing text to device drivers...
*for _some_ applications*.

> CMap applies to the fonts used in the document text, not meta-data
> rendered in the system fonts.

That's lame.  It puts paid to the notion of reusing text from the
document as metadata at least for anything outside the Basic
Multilingual Plane, which, like users with 640kB of RAM, was all that
UTF-16 would ever need.  But it's not your or my problem to solve.
"It's searchable, Jim, but not as we know it."

> I thought this simply meant if you input to groff a composite
> character it must be maximally decomposed. The para before deals with
> input of non composite unicode. So this is saying both the groffist
> representation of unicode code points (uXXXX[X[X]] and ‘u’ component1
> ‘_’ component2 ‘_’ component3 ...) are both acceptable as input. 

That's one plausible reading.

> I don't understand why you consider this wording means you have to
> output NCD [recte: NFD] to output drivers, i.e. \[uXXXX_XXXX] rather
> than the \[uXXXX] originally provided by preconv.

Because `\X` _doesn't_ read its argument in copy mode, and there is no
general mechanism in the formatter for reading the argument of _any_
delimited escape sequence in copy mode.

In fact, only two escape sequences read their arguments in copy mode:

1.  \!, which reads to the end of the line.

2.  \?, which, uniquely, reads to another `\?`.  But GNU troff prevents
    its contents from from going to device-independent output, so it
    plays little or no role in the present discussion.

Not wanting to write dedicated logic for a "delimited escape sequence
copy mode reader", I've been trying to make `\X` work better for
arguments to device extension commands while remaining in interpretation
mode.

And seemingly every step I'm taking down that path is causing you
greater alarm, at least when I pitch the idea of having pdf.tmac
actually _use_ `\X` more than the minimal amount it does now.

$ git grep '\\X' tmac/pdf.tmac
tmac/pdf.tmac:.char \[lh] \X'pdf: xrev'\[rh]\X'pdf: xrev'

That's it.

So, to get back to your point, why do I insist that I have to output
NFD?

I don't insist.  \X forces that upon me, because it reads its arguments
in interpretation mode.  That means when `token::next()` reads
`\[u012F]` in the input, that escape sequence is run through a mill of
processing, one step of which _normalizes it_.  By the time the
corresponding `charinfo` for the special character is created, it has
become `\[u0069_0328]`.  That is not a reversible transformation.  If
you feel I'm trying to cram `\[u0069_0328]` down gropdf's throat, it is
because that is all I can give you--it is all `\X` will let me have.

As I believe I have mentioned in one of our many recent exchanges, I
briefly contemplated adding a bit of global state to the formatter, so
that `\X` could shut off this particular behavior.  That felt wrong for
a few reasons.  First, it seemed inelegant.  Second, it means I might
start populating `charinfo` objects with stuff that other parts of the
formatter weren't prepared to handle.  In principle, such `charinfo`
objects wouldn't be living very long: they'd be inside a `\X` escape
sequence, the contents of which don't have a path back to the formatter.

Unless maybe one uses `\X` in a diversion.  Then you'll get that oddball
`charinfo` back when you interpolate it.  I see storm clouds there.

> I see. Of course any file can be considered as _one_ stream, a
> sequence of bytes, but I don't think that helps us much!
> 
> Perhaps I should elucidate. The grout contents can either affect the
> contents of the page or the "container" (this is meta-data). For
> grodvi \X'papersize=...' affects the window size when displayed in a
> dvi viewer, it has no effect on the contents. The PDFMARK extensions
> for grops affect meta- data of the pdf when distilled. There is no way
> of telling from grout whether a particular "x X " command affects the
> document rendition or the container, a shame.

Thanks for walking through this, because as of your final sentence you
have me very badly wanting to inline the widely seen meme of Jack
Nicholson nodding and saying "YES" emphatically.

You trimmed out a giant chunk of my response, which is fine because it
was as rich in irritation as it was technical content, but with that
final observation you have landed squarely on one of the points that is
driving me crazy about the status quo.

'There is no way of telling from grout whether a particular "x X "
command affects the document rendition or the container, a shame.'

That may end up as the epigram on a future commit.

You seem baffled by why I'm changing things that "aren't broke".  That's
why.

> The 't' and 'u' commands are identical to 'C' followed by 'h', they
> are strings of single character glyph names to be found in the current
> font, not text. so "tHello" is  the equivalent of:-
> 
> CH
> h (width of H)
> Ce
> h (width of e)
> etc..
> 
> 'u' is similar except the 'h' commands take into account the value
> given to 'u'.

Yes.  I hope some people on the list appreciate this primer.

> > Here I simply must push back.  This depends entirely on what the
> > device extension does, and we have _no control_ over that.
> 
> Are you using "we" here to mean "groff developers"?

"We" being "authors/maintainers of GNU troff, the formatter".  The
formatter is a Unix filter; once grout is emitted, there is no return.

> Can't be true, since we can change the device drivers, and I can
> particularly change gropdf. Were you using "we" in a more regal mode?

No, but because of your particularity regarding gropdf, it's more
necessary than it would otherwise be to articulate the interface between
the formatter and the output driver.  From a systems design perspective,
that's valuable and even preferable, because a major point of modular
design is elimination of unnecessary couplings.

But it does mean that you and I get to have discussions/arguments/chats
about what exactly that interface _is_, what can and cannot be assumed,
and how GNU troff's output should behave and be structured.

> Isn't it just good practice to reset to a known state after making an
> external call, in assembler you push the registers on the stack before
> making a call to a third-party subroutine, and restore afterwards.

It can be, if restoring that known state doesn't have side effects.
Unfortunately we run smack into Savannah #64484 at this point.[1]

But thanks for the analogy to assembly language; it may be helpful to
readers if I apply it.

Here's one of the things from my previous reply that had me jumping up
and down with frustration.  It's from node.cpp in GNU troff.

void troff_output_file::start_special(tfont *tf, color *gcol,
                                      color *fcol,
                                      bool omit_command_prefix)
{
  set_font(tf);
  stroke_color(gcol);
  fill_color(fcol);
  flush_tbuf();
  do_motion();
  if (!omit_command_prefix)
    put("x X ");
}

Every object of type "special_node", meaning things constructed by `\X`
escape sequences and `device` requests, causes the foregoing to be
called when it's written out to grout.  Every single one.

No problem, right?

Well, again, like I said in my previous reply, a device extension
command _might_ do anything.  It's like a function pointer (or an
indirect call in assembly).  Invoking it might cause _all_ registers to
be clobbered, or none, or just some.  In ordinary ABIs, we document
_which_ registers are "dirtied" or clobbered by subroutine calls.  This
has been a fixture of machine language library documentation going back
at least to the 1970s.  (I remember, because this stuff showed up in the
programming manuals of 8-bit micros.  I was the butt of the old Unix
hacker joke--as a kid, I didn't _have_ a nickel.  I _couldn't_ buy a
better computer.)

You may see the problem.  Not having any idea what elements of output
device state the device extension command clobbered/dirtied, GNU troff
just marks _everything_ dirty.

(Again, as I noted before, `flush_tbuf()` is distinguishable from the
other four calls here.  The "tbuf" is not a piece of state that any
output driver cares about.  But it needs to be flushed, a.k.a. any
pending 't' or 'u' command completed, before an 'x' command can follow.)

This, I think, is one of two reasons advanced groff hackers resorted to
`\!` black magic to construct device extension commands by hand.  With
"transparent throughput", you can hide all the salami you want.  GNU
troff will go merrily on its way assuming that you have dirtied no
"registers", or output device state, and consequently not inject any
commands fixing up the colors, font, or drawing position.

Another reason _may_ be that, as noted above, `\X` doesn't read its
arguments in copy mode, so if you interpolated strings in it, they might
contain special characters and could not be meaningfully represented in
device-independent output.  However, `device` _does_ read its arguments
in copy mode.  On the other hand, because it constructs a
"special_node" (absolutely not the same thing as a special character
node,[2]  This is troff--we tell unlike things apart by giving them
similar names. 😠), the same "assume everything is dirty" logic from the
previous paragraph is run, and that, too, could cause undesired results.

So I think people got into the habit of throwing `\!` at a lot of
problems.  I will be surprised, and eat my hat, if macro authors
selected the least esoteric solution for every situation.

> > > \[u0069_0328] is a named glyph in a font, \[u012F] is a 7bit ascii
> > > representation (provided by preconv) of the unicode code point.
> > 
> > Okay, couple of things: 0x12F does not fit in 7 bits, nor even in 8.
> 
> But the text "\[u012F]" does.

Okay, yes.  Copy mode wins here.  It's interpretation mode's insistence
on pre-chewing its special character food that leads to the appearance
of the objectionable \[u0069_0328] in the output.

> What happens when you use \[fi] in text. TR font says:-
> 
> fi      556,683 2       140     fi      --      FB01
> 
> So the postscript name is also "fi" and the postscript to unicode
> mapping says this is codepoint 0xFB01, nowhere does it ever become
> u0066_0069 within the entire code path.
> 
> 
> Even if you ask troff to composite it to what is shown below
> \[u0066_0069], troff sensibly changes this to Cfi in grout, so back
> where we started
> 
> > groff_char(7) again:
> > 
> >    Ligatures and digraphs
> >        Output   Input   Unicode           Notes
> >        ──────────────────────────────────────────────────────────────────
> >        ff       \[ff]   u0066_0066        ff ligature +
> >        fi       \[fi]   u0066_0069        fi ligature +
> >        fl       \[fl]   u0066_006C        fl ligature +
> >        ffi      \[Fi]   u0066_0066_0069   ffi ligature +
> >        ffl      \[Fl]   u0066_0066_006C   ffl ligature +
> >        Æ        \[AE]   u00C6             AE ligature
> >        æ        \[ae]   u00E6             ae ligature
> >        Œ        \[OE]   u0152             OE ligature
> >        œ        \[oe]   u0153             oe ligature
> >        Ĳ        \[IJ]   u0132             IJ digraph
> >        ĳ        \[ij]   u0133             ij digraph

Okay.  What would you like me to change to address your point?

> > I don't know about "simply", but yes.
> 
> Well I understood it, so must be simple.

I seem often to struggle with things you find simple.  ;-)

> > As in two of the precise situations that lifted the lid on this
> > infernal cauldron: the annotation and rendering _outside of a
> > document's text_ of section headings, and document metadata naming
> > authors, who might foolishly choose to be born to parents that don't
> > feel bound by the ASCII character set, and as such can appear
> > spattered with diacritics in an "info" dialog.
> 
> mon_premier_doc.mom (in mom/examples) was authored by one such
> unfortunate:-
> 
> .AUTHOR "Cicéron"
> 
> The "info" dialog looks fine to me, what do you see wrong?

Nothing.  I was attempting humor.

> > If GNU troff is to have any influence over how such things appear,
> > we're must consider the problem of how to express text there, and
> > preferably do so in ways that aren't painful for document authors to
> > use.
> > 
> > > If a user actually wants to use a composite character this is
> > > saying you can enter \[u0069_0328] or you can leave it to preconv
> > > to use \ [u012F]. Unfortunately the way you intend to change
> > > groff, document text will always use the single glyph (if
> > > available)
[this is where a paragraph break and response were inserted by me]
> > > and meta-data will always use a composite glyph.
> > 
> > Eh what?  Where is this implied by anything I've committed or
> > proposed?  (It may not end up mattering given the point I'm
> > conceding.)
> 
> Of course if you split a sentence, you can make it look stupid and
> exclaim "Eh what?" even though the change you have made to groff
> affects the second half of the sentence - the delivery of text to
> output drivers. Nice trick!

But at this point in my response, you had already seen me say 3 times or
more, that I don't think the change is sustainable.  In _this_ message,
I identified where it came from: logic in `token::next()` that
normalizes Unicode special character escape sequences before
constructing `charinfo` objects from them.

> > Strictly, it will always use whatever I get back from certain
> > "libgroff" functions like.  But I'm willing to flex on that.  Your
> > "Se ocksŮ" example is persuasive.
> 
> Yes, but it takes so much spazzy effort to finally persuade you, I
> wish the penny dropped a bit quicker.

I wonder if pellucid documentation would help me any.

Nah, why would anyone waste effort on THAT?  Real hackers go straight to
the source, every time.  Specs are for suits.  UTSL!

> > Though some irritated Swede is bound to knock us about like tenpins
> > if we keep deliberately misspelling "också" like that.
> 
> I profusely apologise, it was entirely for demonstration.

Another joke of mine.  I was imagining Dolph Lundgren.

> > Okay, how about a more pass-through approach when it comes to byte
> > sequences of the form `\[uxxxx]` (where 'xxxx' is 4 to 6 uppercase
> > hexadecimal digits)?
> 
> Yes please,

I'm working on it.  And cutting off another sentence in the middle.

> > Setting aside the term "user-facing programs", which you and I might
> > define differently, I find the above argument sound.  (Well, I'm a
> > _little_ puzzled by how precomposed characters are so valuable for
> > searching bookmarks since the PDF standard already had the CMap
> > facility lying right there.)
> 
> I've explained that, just like any application, all application text
> (including the bookmarks panel, info dialog, menus etc, are handled by
> the desktop windowing system (GTK, QT ...) only the canvas upon  which
> pages are rendered has access to the CMap.

But you still need a spec for how any application text determined by the
content of the document is to be represented.  For PDFs, it's UTF-16LE,
is that right?

> [Snipped a lot here - did not seem to have much to do whether it was
> sensible to use NFD communicating with device drivers - most of it was
> thinking out loud probably]

Yes, because there are multiple problems I'm trying to solve and at
least some are entangled with one another.

Moreover, the subject line of the thread remains: '"transparent" output
and throughput, demystified', a somewhat broad area.  Updating the
subject to a narrower topic is one good way to suggest that you'd like
to reduce the scope of discussion.

Not that I mind--if even one person reading this has thought to
themselves, "oh, it's not just me--some of this stuff looks fiendishly
complicated and under-documented to other people, too", then my efforts
are rewarded.

More so once I can get better documentation actually written.

> > As noted above, _some_ of this seems to me like a deficiency in PDF,
> > either the standard or the tools.  But, if the aforementioned
> > abandonment makes the problem less vexing, cool.
> 
> See explanation above.

Yup, asked and answered.  Thanks!

> > What's NCD?  Do you mean NFD?  The former usage persists through the
> > remainder of your email.
> 
> Of course, I was starting to flag.

So was I!

> > > I used round-tripping in the general sense that, after processing
> > > you end up back where you started (the same as you used it). Why
> > > does groff have to be involved for something to be considfered a
> > > round-trip?
> > 
> > I guess we were thinking about the problem in different ways.  I am
> > pretty deeply concerned about input to and output from the GNU troff
> > program specifically in this discussion.
> 
> Wood/Trees.

In another life I'd write a treatise on the habits of a species of wood
louse to be found only in a 15 square kilometer area in central
Derbyshire.

> > > Ok, if it can't be done, just leave what you have changed in \X,
> > > but leave .device and .output (plus friends) to the current
> > > copy-in mode which seem to be working fine as they are now,
> 
> At some point you did tell me it had to be NFD because you were using
> the routine which generated the glyph names (which are NFD).

Yes.  I've tried to be more specific above,

> > Here are the coupled pairs as I conceive them.  Which two did you
> > have in mind?  If I'm overlooking something, you'd be doing me a
> > favor in telling me.[6]
> 
> Don't say I never do you a favour. :-)

I won't.  :)

> \Y
> .devicem

Yes.  Thanks!  I was aware but had put a big censorship bar over them in
my brain because I wanted to tackle the other problems first, and
neglected to rip the tape off when attempting to present the subject.

Another potentially thorny area, as macros are definitely read in copy
mode but just as definitely interpolated in interpretation mode.  Yet
another reason that we will continue to have diagnostics about odd
things being sent to device-independent output (formerly "can't write
hocus-cadabra 233 to transparent abraca-pocus--what the hell is the
matter with you?").

I'll have to be thinking about them soon.  But they're another good
example of coupling/complementarity.

> If you intend to include these in the changes you have made to \X you
> had better talk to Tadziu who often uses these (i.e. yesterday) since
> the problem described in https://savannah.gnu.org/bugs/?66165#comment0
> would stop the postscript snippet posted yesterday from working since
> you are changing "-" on input to \[u2010], so text such as "-1" become
> "\[u2010]1" which a postscript interpreter will not understand.

Yes, I'll be considering that as automated test fodder.  I hope it does
not surprise you to learn that I prefer to find problems myself rather
than by alarming you.

> > > Not if you restrict the changes to \X only, and document the
> > > difference in behaviour from the other 7 methods.
> > 
> > That's the status quo, but for the reasons I think I have thoroughly
> > aired above, I think it's a bad one.  Authors of interfaces to
> > device features that _you'd think_ would suggest the use of the
> > "device-related" escape sequence and request have avoided them to
> > date because of the undesirable side effects.
> > 
> > "Yeah, we have >this< for that, but nobody uses it.  Instead we just
> > go straight to page description assembly language."
> > 
> > Is no one ashamed of this?
> 
> I might be ashamed if I understood what you were talking about, and
> knew who you were quoting ("Yeah, ...").

Certain brogrammers of my acquaintance, not people from this list.  I'm
a little less hot under the collar about `\!` and `output` usage today,
but I want the circumstances of their usage in construction of device
extension commands to be clearly circumscribed and documented.  I'm
beginning to perceive what those boundaries might be.

> As I read this email I thought "The pith of this email is that Branden
> agrees that passing groff unicode characters (\[uXXXX]) in NFD format
> is sub optimal, and he will revisit the code to rectify.".

If your only concern is getting me to revert commits, yes.

But I am also concerned with airing the motivation for changes that you
consider moronic, or at least ill-premised, in the first place.  If that
leads to someone piping up with a proper explanation (or even a proposed
solution!) to the problem that vexes me, then we _all_ win.

And if nobody does, because what I am working on is poorly understood
and regarded as untouchable black magic by most, then I'm glad to shine
a light on it.  That, too, is a bug that needs fixing.

> It could have been almost as short as Dave's but it wasn't. :-(

Them's the wages of technical debt.  I'm almost as sorry as you are to
have to pay for it.

But only almost.  I won't let it fester if I think myself able to do
something constructive about it.

> See you in bug #66155.

Among others, most likely.  ;-)

Regards,
Branden

[1] https://savannah.gnu.org/bugs/?64484

[2] There's no such thing--that would be a "glyph_node", duh.  Isn't
    everybody _born_ knowing this?  I simply cannot write about this
    stuff without getting angry at people's refusal to document things.

P.S.  The "static member function" idea in my earlier reply is a no-go--
      it erupted in my face for the exact reason I thought it might.
      But that's good--it means I'm getting better at predicting when
      C++ will betray me.

signature.asc
Description: PGP signature

Re: "transparent" output and throughput, demystified

Reply via email to