Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

2024-02-07 Thread Peter Schaffter
On Wed, Feb 07, 2024, Deri wrote:
> It is with a heavy heart that I announce I shall be leaving the groff mailing 
> list, I'm finding it too much work.

Deri --

This news just unmade my day.  The value of gropdf and your
contributions to mom have been immeasurable.  I will miss your voice
on the list.

-- 
Peter Schaffter
https://www.schaffter.ca



Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

2024-02-07 Thread Dave Kemper
Deri,

I'm very sad to see your departure.  This list and the groff project
will sorely miss your knowledge, expertise, and coding skills.  I hope
you find a way to return soon.



Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

2024-02-07 Thread Deri
Hi Branden,

It is with a heavy heart that I announce I shall be leaving the groff mailing 
list, I'm finding it too much work. Yesterday, you managed over 2900 words in 
approx 80 minutes whilst also doing a code review, and probably 3 other things 
as well! I am so jealous my paltry sub 10 words a minutes (on a good day) just 
can't cope, particularly as you sometimes have difficulty getting the points I 
am trying to make and reply with points which are not relevant. An example was 
your "unease" with adding an extra field to afmtodit output, you pointed me to 
some documentation rather than a swift perusal of the code in afmtodit where 
you can see that it ALWAYS outputs 5 tab characters and never outputs -- so no 
comments. And talking of "unease", you wrote, in reply to my request for help 
with merging: "Sure.  Once we're _both_ happy with it!  :D", and this was 
eight months ago, so unease really does mean rejection. In November it 
became:-

> > 1.  Changing the format of font description files to add yet
> > another field, mapping character names to Unicode code points.
> > In the rest of groff, this is not necessary because we have
> > glyphuni.cpp.
> >
> >
> > https://git.savannah.gnu.org/cgit/groff.git/tree/src/libs/libgroff/
glyphuni
> > .cpp
> >
> > I'd like to honor the DRY principle here.  What's a good way
> > to achieve that?

Given that afmtodit does not use glyphuni.cpp (and can't) the DRY principle 
here means to let afmtodit plant the needed data in the font files for gropdf 
to use, but you didn't seem to see how irrelevant your comment was.

Anyway, enough of this useless banter. This is a joyful moment, I'm freeing up 
so much time to pursue other projects that will be equally rewarding as 
writing gropdf has been, like:-

Detection of bias in UK news channels

In the UK there is a legal obligation for "Due impartiality and due accuracy" 
(https://www.ofcom.org.uk/tv-radio-and-on-demand/broadcast-codes/broadcast-code/section-five-due-impartiality-accuracy).
 For the past 4 years, I have 
been converting the dvb-t subtitles for news channels into text using an OCR 
program I wrote. It's about time I fooled around with the data using NLP and 
see if it is possible to detect bias within the data, at a minimum I can 
extract statistics on the political persuasion of guests, but I've got a 
feeling I might be able to go further.

GB News, a right wing channel, keeps getting fined. I'd love to be able to 
write something which automatically emailed a complaint to Ofcom if it caught 
them breaking the rules, without having to watch the channel all day. :-)

My autobiography

Well, I've got the title - "A life more ordinary"!!

If I ever get the gropdf itch in the future, this is my todo list:-

A) Underlining text.

Peter asked if I could do this, ages ago because he has a method for 
postscript, from Tadziu. It is half done.

B) Watermarking

Given a pdf scale to full page size and place it under the groff output, or 
stamp, put it above it. I have worked out the last wrinkle. Normally, if you 
rotate the page with -P-l any pdfpic will be rotated as well, so that the 
picture orientation stays with the text orientation but the watermark 
orientation is controlled by the page orientation.

C) Ttf/otf in pdfs

This is a lot of work, but I was starting to get a handle on it. Incidentally, 
if I ever do get this done, the Tibet ligatures issue will be solved. The 
reason it seems to be Ok everywhere else except in groff, is because the 
"rules" for the ligature placement/resizing are in sub-tables within the ttf 
font file, but in the fontforge conversion to a pfa file most of this 
information is discarded because type 1 fonts have no concept of vertical 
adjustments so all that gets through is the horizontal adjustment which 
ensures the glyphs print over each other, but without the correct vertical 
adjustment/sizing. Still a lot of research to do.

I've just seen your last email with a lot of nice things, but sometimes you 
confuse "code review" with "design review". If someone wants to know how to 
get to the doctor it is not helpful to say "Well I would not start from here”. 
I have told you right from the beginning that all I needed was a way to pass 
anything to gropdf, and so I coded on the expectation I could receive anything 
and dealt with it appropriately. This is all working code. Later you expressed 
a preference for a method where you would clean the data within troff so I 
would not need to, but I already had working code and so far any alternative 
is vapourware, and the only pseudo code I've seen (a for loop with a flag to 
indicate whether the next item is a node or a character), with the expectation 
that nodes will be discarded, would not cut the mustard because I believe 
special characters (i.e. \[u] or \[em]) are actually held as nodes within 
troff so would be discarded as not a character. So the criticism is of my 
design, hardly what I call a code review and 

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-07 Thread G. Branden Robinson
[self-follow-up]

Hi Deri,

One more thing occurred to me, because your last paragraph was sticking
in my mind and I think I figured out why.

At 2024-02-06T19:30:58-0600, G. Branden Robinson wrote:
> > I am quite sure there will be "bugs" in my code, it is fairly
> > complex, but subjecting it to a "code review" without even running
> > it to see if it does what it says on the box, is not helpful.
> 
> I think you've pretty badly mistaken my perspective.  One of the
> reasons I stick my long nose into your code in this way is because I
> don't worry that you won't produce correct results.  You have an
> established record of delivering solutions that work as advertised.

That you put code review into scare quotes gave me a sort of belated
pause.  It finally dawned on me that you might be regarding my
undertaking of such on your contributions as a form of insult.

It emphatically is not!

Some computer science luminary--unfortunately I cannot remember who at
the moment--made the observation that programming languages chiefly
exist so that human beings can communicate to each other about
programming.  (Maybe someone reading recollects who I mean.)  If PLs
were intended _solely_ for consumption by machines, we'd stick with
machine language...or maybe assembly.

At the places I have worked, and at sites like GitHub and GitLab where
people manage things like pull requests and merge requests, it is not
only common for people other than the code author to undertake code
review before attempting to run it themselves, it is expected that they
won't!

Part of this is due to the cultural expectation that the author of code
will have tested it.  But another aspect is that humans are actually
pretty bad at inferring (perfect) correctness from inspection of source
code.  We are indeed likely to assume that it does what is on the box.
What code review is good for--and I think I said this recently on this
list, but maybe it was someplace else--is for programmers to share
expertise and problem-solving techniques with each other, and also to
reinforce the team mentality that sustains successful software projects
above the very small scale.

So I would ask that you please try to adopt that perspective when a
person perceptibly studies your, or anyone else's code.  Not all code
is worthy of study.  The famous Lions book presenting the Sixth Edition
Unix kernel was not an insult to Thompson and Ritchie, but a high form
of flattery...and today that book stands as a monument in the field of
operating systems research as an exposition of a successful,
high-quality system.

At the time time, everybody had gripes about the Unix kernel and some
aspects of how it was written, and even designed.  This is how we learn,
individually and collectively.

So, if I pay your code some scrutiny, it is not out of hauteur, but
respect.  I look at your code because I want to work with you.

I'm appreciate what you've contributed to groff and am pleased by how
well-received your efforts continue to be.

Best regards,
Branden


signature.asc
Description: PGP signature


Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread G. Branden Robinson
Hi Deri,

At 2024-02-06T21:35:05+, Deri wrote:
> Many thanks for your thoughts on my code. I shall reply in general
> terms since your grasp of some of the issues is rather hazy, as you
> admit.

I generally don't feel I grasp code of nontrivial complexity until I've
documented it and written tests for it, and often not even then.  I'm a
bear of very little brain!

> Huge AGL lookup table
> 
> My least favourite solution, but you made me do it! The most elegant
> and efficient solution was to make a one line amendment to afmtodit
> which added an extra column to the groff font files which would have
> the UTF-16 code for that glyph. This would only affect devpdf and
> devps and I checked the library code groff uses to read its font files
> was not affected by an extra column. I also checked the buffer used
> would not cause an overflow. Despite this, you didn't like this
> solution, without giving a cogent reason,  but suggesting a lookup
> table!

I remember expressing unease with the "new column" approach, not
rejection.  The main reason is the documented format of the lines in
question.

groff_font(5):
 The directive charset starts the character set subsection.  (On
 typesetters, this directive is misnamed since it starts a list of
 glyphs, not characters.)  It precedes a series of glyph
 descriptions, one per line.  Each such glyph description comprises
 a set of fields separated by spaces or tabs and organized as
 follows.

name metrics type code [entity‐name] [-- comment]

[...]

 The entity‐name field defines an identifier for the glyph that the
 postprocessor uses to print the troff glyph name.  This field is
 optional; it was introduced so that the grohtml output driver could
 encode its character set.  For example, the glyph \[Po] is
 represented by “” in HTML 4.0.  For efficiency, these data
 are now compiled directly into grohtml.  grops uses the field to
 build sub‐encoding arrays for PostScript fonts containing more than
 256 glyphs.  Anything on the line after the entity‐name field or
 “--” is ignored.

The presence of 2 adjacent optional fields seems to me fairly close to
making the glyph descriptions formally undecidable.  In practice,
they're not, until and unless someone decides to name their "entity"
"--"...  (We don't actually tell anyone they're not allowed to do that.)

As I understand it, this feature is largely a consequence of the
implementation of grohtml 20-25 years ago, where an "entity" in HTML 4
and XHTML 1 was a well-defined thing.  We might do well to tighten the
semantics and format of this optional fifth field a bit more.

More esteemed *roffers than I have stumbled over our documentation's
unfortunate tendency to sometimes toss the term "entity" around,
unmoored from any formal definition in the *roff language.

https://lists.gnu.org/archive/html/groff/2023-04/msg2.html

While I'm complaining about hazy terminology that exacerbates my hazy
understanding of things, I'll observe that I don't understand what the
verb "to sub-encode" means.  I suspect there are better words to express
what this is trying to get across.  If I understood what grops was
actually doing here, I'd try to find those words.

> As to whether I should embed the table, or read it in, I deferred to
> the more efficient method use by afmtodit, embed it as part of make. I
> still would prefer the extra column solution, then there is no lookup
> at all.

I don't object to the idea, but I think our design decisions should be
documented, and it frequently seems to fall to me to undertake the
documentation.  That means I have to ask a lot of questions, which
programmers sometimes interpret as critique.  (And, to be fair,
sometimes I actually _do_ have critiques of an implementation.)

> use_charnames_in_special
> 
> Probably unnecessary once you complete the work to return .device to
> its 1.23.0 condition, as you have stated.

That seems like a fair prediction.  Almost all of the logic _on the
formatter side_ that employs this parameter seems to be in one function,
`encode_char()`.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n5427

(Last month, I renamed that to `encode_char_for_troff_output()` and I'm
thinking it can be further improved, more like
`encode_char_for_device_control()`...

...there's just one more thing.

There's one other occurrence, in a constructor.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n293

I look forward to someday understanding what that's doing there.)

> pdfmomclean
> 
> Not quite sure how your work on #64484 will affect this, we will have
> to wait and see.

Fair enough.

> Stringhex
> 
> Clearly you are still misunderstanding the issue, because there are
> some incorrect statements.

Okay.

> In any lookup there is a key/value pair.

I'm with ya so far.

> If dealing with a document written in Japanese, both the key and 

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread Deri
On Tuesday, 6 February 2024 14:45:59 GMT G. Branden Robinson wrote:
> Hi Deri,
> 
> Now _does_ seem to be a good time to catch up on gropdf-ng merge status.
> There were two things I knew were still unmerged: the slanted symbol
> (SS) font support and the stringhex business, which has larger
> consequences than I understood at first.
> 
> At 2024-02-06T13:39:51+, Deri wrote:
> > The current gropdf (in the master branch) does support UTF-16BE
> > for pdf outlines (see attached pdf), but Branden has not released
> 
> At this point it's a merge (to the master branch), not a release, but
> true with that caveat.
> 
> So let me take a crack at a code review.

Hi Branden,

Many thanks for your thoughts on my code. I shall reply in general terms since 
your grasp of some of the issues is rather hazy, as you admit.

Huge AGL lookup table

My least favourite solution, but you made me do it! The most elegant and 
efficient solution was to make a one line amendment to afmtodit which added an 
extra column to the groff font files which would have the UTF-16 code for that 
glyph. This would only affect devpdf and devps and I checked the library code 
groff uses to read its font files was not affected by an extra column. I also 
checked the buffer used would not cause an overflow. Despite this, you didn't 
like this solution, without giving a cogent reason,  but suggesting a lookup 
table!

As to whether I should embed the table, or read it in, I deferred to the more 
efficient method use by afmtodit, embed it as part of make. I still would 
prefer the extra column solution, then there is no lookup at all.

use_charnames_in_special

Probably unnecessary once you complete the work to return .device to its 
1.23.0 condition, as you have stated.

pdfmomclean

Not quite sure how your work on #64484 will affect this, we will have to wait 
and see.

Stringhex

Clearly you are still misunderstanding the issue, because there are some 
incorrect statements.

In any lookup there is a key/value pair. If dealing with a document written in 
Japanese, both the key and the value will arrive as unicode. No problem for 
the value, but the key will be invalid if used as part of a register name. 
There are two obvious solutions. One is to encode the key into something, 
easily decoded, which is acceptable to be used as part of a register name, or 
do a loop traversal over two arrays, one holding the keys and one the values. 
I'm pretty sure my 9yr old grandson would come up with a looping solution. I 
really don't understand your opposition to the encoding solution, Ok, I accept 
you would have done it the childs way with the performance hit, but I prefer 
the more elegant encoding solution.

Uniqueness of keys is an issue for either strategy. In mom, a user supplied 
key name is only possible by using the NAMED parameter, and if a user uses the 
same name twice in the document nothing nasty will happen, the overview panel 
will be correct, since each of those is tagged with a safe generated name, and 
if they have used the same name for two different places in the document, when 
they are checking all the intra-document links they will find one of them will 
go to the wrong place. Of course this could be improved by warning when the 
same name is provided for a different destination. The man/mdoc macros 
currently have no named destinations, all generated, but this will change if 
the mdoc section referencing is implemented.

You mention a possible issue if a diversion is passed to stringhex, since this 
is 95% your own code for stringup/down, I'm pretty sure that whatever you do 
to solve the issue in your own code can be equally applied to stringhex, so 
this not an argument you can use to prevent its inclusion.

As regards your point 2, this is a non-issue, in 1.23.0 it works fine with 
.device. You ask what does:-

\X'pdf: bizzarecmd \[u1234]'

Mean? Well, assuming you are writing in the ethiopic language and wrote:-

\X'pdf: bizzarecmd ሴ'

And gropdf would do a "bizzarecmd" using the CHARACTER given (ETHIOPIC 
SYLLABLE SEE). Which could be setting a window title in the pdf viewer, I'm 
not sure, I have not written a handler for bizzarecmd. As you can see not 
"misleading to a novice" at all, the fact that preconv changed it to be a 
different form and gropdf changed it back to a character to use in pdf meta-
data is completely transparent to the user.

Your work on \X and .device is to put .device back to how it was in 1.23.0 and 
alter \X to be the same, this is what you said would happen.

The purpose of my patch was intended to give Robin a robust solution to what 
he wanted to do.

You wrote in another email:-

"But tparm(const char *str, long, long, long, long, long, long, long,
long, long) is one of the worst things I've ever seen in C code.

As I just got done saying (more or less) to Deri, when you have to
obfuscate your inputs to cram them into the data structure you're using,
that's a sign that you're using 

gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread G. Branden Robinson
Hi Deri,

Now _does_ seem to be a good time to catch up on gropdf-ng merge status.
There were two things I knew were still unmerged: the slanted symbol
(SS) font support and the stringhex business, which has larger
consequences than I understood at first.

At 2024-02-06T13:39:51+, Deri wrote:
> The current gropdf (in the master branch) does support UTF-16BE
> for pdf outlines (see attached pdf), but Branden has not released

At this point it's a merge (to the master branch), not a release, but
true with that caveat.

So let me take a crack at a code review.

diff --git a/contrib/mom/om.tmac b/contrib/mom/om.tmac
index d3b5002a8..87d9ba3cb 100644
--- a/contrib/mom/om.tmac
+++ b/contrib/mom/om.tmac
@@ -4906,7 +4906,7 @@ y\R'#DESCENDER \\n[.cdp]'
 .ds $AUTHOR \\*[$AUTHOR_1]
 .substring $AUTHORS 0 -2
 .ds PDF_AUTHORS \\*[$AUTHORS]
-.pdfmomclean PDF_AUTHORS
+.if '\\*[.T]'ps' .pdfmomclean PDF_AUTHORS
 .nop \!x X ps:exec [/Author (\\*[PDF_AUTHORS]) /DOCINFO pdfmark
 .END
 .
@@ -23512,13 +23512,13 @@ No room to start \\*[MN-pos] margin note 
#\\n[MN-curr] on page \\n[#P].
 .  el .nr LEVEL_REQ \\n[CURRENT_LEVEL]
 .   \}
 .   ds PDF_TX \\$*
-.   pdfmomclean PDF_TX
 .   nr PDF_LEV (\\n[LEVEL_REQ]*\\n[#PDF_BOOKMARKS_OPEN])
 .   ie '\\*[.T]'ps' \{\
 .   if !'\\*[PDF_NM]'' \{\
 .  pdfhref M -N \\*[PDF_NM2] -- \\*[PDF_TX]
 .  if !dpdf:href.map .tm gropdf-info:href \\*[PDF_NM2] \\*[PDF_TX]
 .   \}
+.   pdfmomclean PDF_TX
 .   pdfbookmark \\n[PDF_LEV] \\*[PDF_TX]
 .   \}
 .   el .pdfbookmark \\*[PDF_NM] \\n[PDF_LEV] \\$*
@@ -23539,7 +23539,7 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] 
on page \\n[#P].
 \#
 .MAC PDF_TITLE END
 .ds pdftitle \\$*
-.pdfmomclean pdftitle
+.if '\\*[.T]'ps' .pdfmomclean pdftitle
 .nop \!x X ps:exec [/Title (\\*[pdftitle]) /DOCINFO pdfmark
 .END
 \#

I hope to made "pdfmomclean" unnecessary with my revised fix for
Savannah #64484.[1]  Or at least enabled it to be shorter and simpler.

@@ -23612,8 +23612,10 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] 
on page \\n[#P].
 .if '\\*[PDF_AST]'*' \{\
 .chop PDF_TXT
 .ie '\\*[.T]'pdf' \{\
-.   ie d pdf:look(\\*[PDF_NM]) \
-.   as PDF_TXT 
\&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM])]\\*[PDF_AST_Q]
+.   ds PDF_NM_HEX \\*[PDF_NM]
+.   stringhex PDF_NM_HEX
+.   ie d pdf:look(\\*[PDF_NM_HEX]) \
+.   as PDF_TXT 
\&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM_HEX])]\\*[PDF_AST_Q]
 .   el \{\
 .   as PDF_TXT Unknown
 .   if !rPDF_UNKNOWN .tm \

In our discussions, significant confusion (mostly mine, I guess) has
surrounded "stringhex".  There are two distinct problems, as I
understand it.

1.  To date in groff development, PDF bookmarks generally get named
using the text they're associated with.  In macro packages there is
a tendency to keep track of the bookmarks by defining strings for
them.  That's not a problem.  The problem is that the associated
text is made _part of the string identifier_ (probably because this
[a] is easy to implement and [b] results in O(1) lookup time given
the way the formatter manages *roff strings).  The trouble is that a
special character escape sequence is not valid in a *roff
identifier.

So while "pdf:look(ABC)" is a valid identifier,
"pdf:look(\[*a]\[*b]\[*c])" is not, and Unicode special character
escape sequences are no different.

In my opinion, it is not a good design to encode the bookmark text
directly into the name of the *roff identifier like this.

A.  We have the problem above.

B.  How do you ensure uniqueness of these strings?  What if I have
multiple places in a document titled, say "Exercises"?

An alternative approach would be to store the bookmark IDs in
strings indexed by a serial number.  A *roff autoincrementing
register is an obvious mechanism for doing this.  When you need to
look up the bookmark, you call a macro that searches the collected
ideas until a match is found.

Back-of-napkin sketch:

.\" Search for bookmark text matching $1.  Find matching bookmark
.\" number in pdf*found if found, otherwise -1 for failure.
bookmark if found, -1
.de pdf:look
.  nr pdf*found -1
.  nr pdf*search-index 0 1
.  while (\\n+[pdf*search-index] < \\n[pdf*max-index])
.if '\\*[pdf*bookmark!\\n[pdf*search-index]'\\$1' \{\
.  nr pdf*found \\n[pdf*search-index]
.  break
.  \}
..

Yes, this is an O(n) search and yes, we still have the uniqueness
problem.

Still another approach is to hash the bookmark identifier in some
way, and that is more or less what you're doing with `stringhex`.
(Strictly, it's not a hash, but an encoding.)  This is back to O(1)
lookup time, which is good, but I regret the