On Saturday, 20 January 2024 00:56:34 GMT G. Branden Robinson wrote:
> Hi Deri,
> 
> At 2024-01-20T00:07:21+0000, Deri wrote:
> > On Friday, 19 January 2024 21:39:57 GMT G. Branden Robinson wrote:
> > > Right.  Before I craft a lengthy response to this--did you see the
> > > footnote?
> > 
> > Yes, sorry, it didn't help. I'm just comparing output now with output
> > in 1.23.0 and what you claim you are doing is the reverse of what I'm
> > seeing.
> 
> I haven't yet pushed anything implementing my (new) intentions,
> reflected in the subject line.  I wanted to gather feedback first.
> 
> What happened was, I thought "the `device` request and `\X` escape
> sequence should behave the same, modulo the usual differences in parsing
> (delimitation vs. reading the rest of the line, the leading double quote
> mechanism in request form, and so forth)".
> 
> Historically, that has never been the case in groff.
> 
> Here's (the meat of) the actual test case I recently wrote and pushed.
> 
> input='.nf
> \X#bogus1: esc \%man-beast\[u1F63C]\\[u1F00]
> -\[aq]\[dq]\[ga]\[ha]\[rs]\[ti]# .device bogus1: req
> \%man-beast\[u1F63C]\\[u1F00] -\[aq]\[dq]\[ga]\[ha]\[rs]\[ti] .ec @
> @X#bogus2: esc @%man-beast@[u1F63C]@@[u1F00]
> -@[aq]@[dq]@[ga]@[ha]@[rs]@[ti]# .device bogus2: req
> @%man-beast@[u1F63C]@@[u1F00] -@[aq]@[dq]@[ga]@[ha]@[rs]@[ti]'
> 
> I know that looks hairy as hell.  I'm testing several things.
> 
> Here is what the output of that test looks like on groff 1.22.3 and
> 1.22.4.
> 
> x X bogus1: esc man-beast\[u1F00] -
> x X bogus1: req @%man-beast\[u1F63C]\[u1F00] -\[aq]\[dq]\[ga]\[ha]\[rs]\[ti]
> x X bogus2: esc man-beast@[u1F00] -
> x X bogus2: req @%man-beast@[u1F63C]@[u1F00] -@[aq]@[dq]@[ga]@[ha]@[rs]@[ti]
> 
> Observations of the above:
> 
> A.  When using `\X`, the escape sequences \%, \[u1F63c], \[aq], \[dq],
>     \[ga], \[ha], \[rs], \[ti] all get discarded.
> 
> B.  When you change the escape character and self-quote it in the
>     formatter, it comes out as-is in the device control command.  I
>     found this absurd, since there is no such thing as an escape
>     character in the device-independent output language, and whatever
>     escaping convention a device-specific control command needs to come
>     up with for things like, oh, expressing Unicode code points is
>     necessarily independent of a random *roff document's choice of
>     escape character anyway.
> 
> Here is what the test output looks like on groff 1.23.0.  It enabled a
> few more characters to get rendered in PDF bookmarks.
> 
> x X bogus1: esc man-beast\[u1F00] -'"`^\~
> x X bogus1: req @%man-beast\[u1F63C]\[u1F00] -\[aq]\[dq]\[ga]\[ha]\[rs]\[ti]
> x X bogus2: esc man-beast@[u1F00] -'"`^\~
> x X bogus2: req @%man-beast@[u1F63C]@[u1F00] -@[aq]@[dq]@[ga]@[ha]@[rs]@[ti]
> 
> Here is what the test output looks like on groff Git HEAD.  It was my
> first stab at solving the problem, the one I am now having partial
> second thoughts about.
> 
> x X bogus1: esc man-beast\[u1F00] -'"`^\~
> x X bogus1: req man-beast\[u1F00] -'"`^\~
> x X bogus2: esc man-beast\[u1F00] -'"`^\~
> x X bogus2: req man-beast\[u1F00] -'"`^\~
> 
> I was briefly happy with this, but I started wondering what happens when
> you interpolate any crazy old damned string inside a device control
> command and I rapidly became uncomfortable.  Because `\X` does not read
> its argument in copy mode, it can get exposed to "nodes" (and in groff
> Git, `device` can too)--this is that old incomprehensible nemesis that
> afflicted pdfmom users relentlessly before 1.23.0.[1][2][3][4][5][6]
> 
>       can't transparently output node at top level
> 
> But the reason 1.23.0 doesn't throw these errors is because I hid them,
> not because we fixed them.[7]

Hi Branden,

It might be worth clarifying what this caused this error to appear (before you 
suppressed it in 1.23.0). A particularly "fruity" bookmark appears in the mom 
example file mom-pdf.mom. It uses:-

.HEADING 1 \
"Comparison of \-Tps\*[FU4]/\*[FU2]\-mpdfmark with \-Tpdf\*[FU4]/\*[FU2]\-mom

Which after expansion becomes this:-

7. Comparison of \-Tps\h'(\En[.ps]u/\E*[$KERN_UNIT]u*4u)'/\h'(\En[.ps]u/
\E*[$KERN_UNIT]u*2u)'\-mpdfmark with \-Tpdf\h'(\En[.ps]u/
\E*[$KERN_UNIT]u*4u)'/\h'(\En[.ps]u/\E*[$KERN_UNIT]u*2u)'\-mom

And this passed to .pdfbookmark! In the version of pdf.tmac used until now, 
this monstrous string is run through .asciify to produce:-

7. Comparison of Tps/mpdfmark with Tpdf/mom

You can see that all the "\-" are missing, .asciify left them as nodes, and 
each of them would elicit the error. So under 1.22.4 this is what the overview 
bookmark in the pdf looked like:-

96 0 obj
<<
/Dest /pdf:bm23
/Parent 93 0 R
/Title (7. Comparison of Tps/mpdfmark with Tpdf/mom)
/Prev 109 0 R 
>>
endobj

Obviously, using .asciify is not the answer, particularly since each unicode 
character (\[uXXXX]) is a node which can't be asciified, so gets dropped. So 
in the latest version of pdf.tmac, not incorporated by Branden yet, the use of 
asciify has been dropped and the complete, raw, string, is passed to the 
output driver, so it becomes gropdf's job to make sense of the bookmark. The 
grout output looks like:-

x X ps:exec [/Dest /pdf:bm23 /Title (7. Comparison of \-Tps\h'(\En[.ps]u/
\E*[$KERN_UNIT]u*4u)'/\h'(\En[.ps]u/\E*[$KERN_UNIT]u*2u)'\-mpdfmark with \-
Tpdf\h'(\En[.ps]u/\E*[$KERN_UNIT]u*4u)'/\h'(\En[.ps]u/\E*[$KERN_UNIT]u*2u)'\-
mom) /Level 2 /OUT pdfmark

But when gropdf writes the pdf it contains:-

96 0 oj << /Dest /pdf:bm23
/Parent 75 0 R 
/Prev 91 0 R 
/Title (7. Comparison of -Tps/-mpdfmark with -Tpdf/-mom)
>>
endobj

Which you can see is a more accurate rendition of what the bookmark should be. 
The new pdf.tmac with the now released gropdf successfully handles all unicode 
(\[uXXXX]), groff named glyphs (i.e. \[em] or \(em), and even \N'233' type, 
when they are passed to the output driver. This means that passing unicode in 
device controls is not an issue at all, no need to invent a new way, just 
using the well established convention of using \[uXXXX] for the unicode 
characters, which preconv provides.

Cheers 

Deri

> An aim of this proposal is to truly fix them.
> 
> I hope it will surprise no one to learn that I have recently also
> updated our documentation regarding tokens, nodes, how these relate to
> GNU troff's input processing, and related matters.
> 
> > I hope I don't elicit a too lengthy response.
> 
> I know such hope oft seems forlorn when talking to me...
> 
> > There are 3 logical possibilities for the list to decide:-
> > 
> > 1) .device behaves like \X.
> > 
> > This seems to be what Branden has done at the moment. Disadvantage is
> > that as a by-product you can't send unicode to the output drivers
> > using either method,
> 
> I'm not happy with this status quo, but this doesn't exactly mean you
> "can't send Unicode to output drivers".  What you have to do is _decide
> upon an encoding mechanism for them_.  That will be true no matter which
> way we solve this.  But I think it's best if there is _one_ way (per
> output driver, anyway), not two different ones depending on whether your
> encoded Unicode sequence is passed via `device` or `\X`.  This stuff is
> challenging enough to the user that that seems like gratuitous cruelty.
> 
> Unfortunately that _has been_ the status quo.
> 
> > and some escapes affect the text stream when the expectation is for
> > things sent to the output driver should not affect text stream.
> 
> Right.  That is what alarmed me about reading `device` and `\X`
> arguments in interpretation mode.
> 
> > 2) \X behaves like .device.
> > 
> > This is what Branden said was the intention. This allows pdf title
> > (normally shown in the window header in a pdf viewer) to use unicode.
> 
> This might be more accurately stated as:
> 
> 2) \X behaves like .device used to (in groff 1.23.0 and earlier).
> 
> And I repeat: this is a _hard_ prerequisite to expressing Unicode
> sequences in the output, but it seems like a useful so that authors of
> output drivers (and supporting macro files for them) can keep their
> sanity.
> 
> But making this happen means changing a CSTR #54 (1992) feature, not a
> GNU extension, so I felt I didn't have any wiggle room, and that the
> issue was best mooted on the list.
> 
> > 3) Leave things as they were prior to recent commits.
> 
> I'll be interested to see the argument from anyone who wants to defend
> the groff 1.22.{3,4} test case exhibits above.
> 
> > It will be interesting to hear from as many people as possible which
> > they think is the best option. I definitely think we should not be
> > making the use of unicode harder.
> 
> Strongly agreed.
> 
> Regards,
> Branden
> 




Reply via email to