Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))
On Wed, Feb 07, 2024, Deri wrote: > It is with a heavy heart that I announce I shall be leaving the groff mailing > list, I'm finding it too much work. Deri -- This news just unmade my day. The value of gropdf and your contributions to mom have been immeasurable. I will miss your voice on the list. -- Peter Schaffter https://www.schaffter.ca
Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))
Deri, I'm very sad to see your departure. This list and the groff project will sorely miss your knowledge, expertise, and coding skills. I hope you find a way to return soon.
Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))
Hi Branden, It is with a heavy heart that I announce I shall be leaving the groff mailing list, I'm finding it too much work. Yesterday, you managed over 2900 words in approx 80 minutes whilst also doing a code review, and probably 3 other things as well! I am so jealous my paltry sub 10 words a minutes (on a good day) just can't cope, particularly as you sometimes have difficulty getting the points I am trying to make and reply with points which are not relevant. An example was your "unease" with adding an extra field to afmtodit output, you pointed me to some documentation rather than a swift perusal of the code in afmtodit where you can see that it ALWAYS outputs 5 tab characters and never outputs -- so no comments. And talking of "unease", you wrote, in reply to my request for help with merging: "Sure. Once we're _both_ happy with it! :D", and this was eight months ago, so unease really does mean rejection. In November it became:- > > 1. Changing the format of font description files to add yet > > another field, mapping character names to Unicode code points. > > In the rest of groff, this is not necessary because we have > > glyphuni.cpp. > > > > > > https://git.savannah.gnu.org/cgit/groff.git/tree/src/libs/libgroff/ glyphuni > > .cpp > > > > I'd like to honor the DRY principle here. What's a good way > > to achieve that? Given that afmtodit does not use glyphuni.cpp (and can't) the DRY principle here means to let afmtodit plant the needed data in the font files for gropdf to use, but you didn't seem to see how irrelevant your comment was. Anyway, enough of this useless banter. This is a joyful moment, I'm freeing up so much time to pursue other projects that will be equally rewarding as writing gropdf has been, like:- Detection of bias in UK news channels In the UK there is a legal obligation for "Due impartiality and due accuracy" (https://www.ofcom.org.uk/tv-radio-and-on-demand/broadcast-codes/broadcast-code/section-five-due-impartiality-accuracy). For the past 4 years, I have been converting the dvb-t subtitles for news channels into text using an OCR program I wrote. It's about time I fooled around with the data using NLP and see if it is possible to detect bias within the data, at a minimum I can extract statistics on the political persuasion of guests, but I've got a feeling I might be able to go further. GB News, a right wing channel, keeps getting fined. I'd love to be able to write something which automatically emailed a complaint to Ofcom if it caught them breaking the rules, without having to watch the channel all day. :-) My autobiography Well, I've got the title - "A life more ordinary"!! If I ever get the gropdf itch in the future, this is my todo list:- A) Underlining text. Peter asked if I could do this, ages ago because he has a method for postscript, from Tadziu. It is half done. B) Watermarking Given a pdf scale to full page size and place it under the groff output, or stamp, put it above it. I have worked out the last wrinkle. Normally, if you rotate the page with -P-l any pdfpic will be rotated as well, so that the picture orientation stays with the text orientation but the watermark orientation is controlled by the page orientation. C) Ttf/otf in pdfs This is a lot of work, but I was starting to get a handle on it. Incidentally, if I ever do get this done, the Tibet ligatures issue will be solved. The reason it seems to be Ok everywhere else except in groff, is because the "rules" for the ligature placement/resizing are in sub-tables within the ttf font file, but in the fontforge conversion to a pfa file most of this information is discarded because type 1 fonts have no concept of vertical adjustments so all that gets through is the horizontal adjustment which ensures the glyphs print over each other, but without the correct vertical adjustment/sizing. Still a lot of research to do. I've just seen your last email with a lot of nice things, but sometimes you confuse "code review" with "design review". If someone wants to know how to get to the doctor it is not helpful to say "Well I would not start from here”. I have told you right from the beginning that all I needed was a way to pass anything to gropdf, and so I coded on the expectation I could receive anything and dealt with it appropriately. This is all working code. Later you expressed a preference for a method where you would clean the data within troff so I would not need to, but I already had working code and so far any alternative is vapourware, and the only pseudo code I've seen (a for loop with a flag to indicate whether the next item is a node or a character), with the expectation that nodes will be discarded, would not cut the mustard because I believe special characters (i.e. \[u] or \[em]) are actually held as nodes within troff so would be discarded as not a character. So the criticism is of my design, hardly what I call a code review and
Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)
[self-follow-up] Hi Deri, One more thing occurred to me, because your last paragraph was sticking in my mind and I think I figured out why. At 2024-02-06T19:30:58-0600, G. Branden Robinson wrote: > > I am quite sure there will be "bugs" in my code, it is fairly > > complex, but subjecting it to a "code review" without even running > > it to see if it does what it says on the box, is not helpful. > > I think you've pretty badly mistaken my perspective. One of the > reasons I stick my long nose into your code in this way is because I > don't worry that you won't produce correct results. You have an > established record of delivering solutions that work as advertised. That you put code review into scare quotes gave me a sort of belated pause. It finally dawned on me that you might be regarding my undertaking of such on your contributions as a form of insult. It emphatically is not! Some computer science luminary--unfortunately I cannot remember who at the moment--made the observation that programming languages chiefly exist so that human beings can communicate to each other about programming. (Maybe someone reading recollects who I mean.) If PLs were intended _solely_ for consumption by machines, we'd stick with machine language...or maybe assembly. At the places I have worked, and at sites like GitHub and GitLab where people manage things like pull requests and merge requests, it is not only common for people other than the code author to undertake code review before attempting to run it themselves, it is expected that they won't! Part of this is due to the cultural expectation that the author of code will have tested it. But another aspect is that humans are actually pretty bad at inferring (perfect) correctness from inspection of source code. We are indeed likely to assume that it does what is on the box. What code review is good for--and I think I said this recently on this list, but maybe it was someplace else--is for programmers to share expertise and problem-solving techniques with each other, and also to reinforce the team mentality that sustains successful software projects above the very small scale. So I would ask that you please try to adopt that perspective when a person perceptibly studies your, or anyone else's code. Not all code is worthy of study. The famous Lions book presenting the Sixth Edition Unix kernel was not an insult to Thompson and Ritchie, but a high form of flattery...and today that book stands as a monument in the field of operating systems research as an exposition of a successful, high-quality system. At the time time, everybody had gripes about the Unix kernel and some aspects of how it was written, and even designed. This is how we learn, individually and collectively. So, if I pay your code some scrutiny, it is not out of hauteur, but respect. I look at your code because I want to work with you. I'm appreciate what you've contributed to groff and am pleased by how well-received your efforts continue to be. Best regards, Branden signature.asc Description: PGP signature
Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)
Hi Deri, At 2024-02-06T21:35:05+, Deri wrote: > Many thanks for your thoughts on my code. I shall reply in general > terms since your grasp of some of the issues is rather hazy, as you > admit. I generally don't feel I grasp code of nontrivial complexity until I've documented it and written tests for it, and often not even then. I'm a bear of very little brain! > Huge AGL lookup table > > My least favourite solution, but you made me do it! The most elegant > and efficient solution was to make a one line amendment to afmtodit > which added an extra column to the groff font files which would have > the UTF-16 code for that glyph. This would only affect devpdf and > devps and I checked the library code groff uses to read its font files > was not affected by an extra column. I also checked the buffer used > would not cause an overflow. Despite this, you didn't like this > solution, without giving a cogent reason, but suggesting a lookup > table! I remember expressing unease with the "new column" approach, not rejection. The main reason is the documented format of the lines in question. groff_font(5): The directive charset starts the character set subsection. (On typesetters, this directive is misnamed since it starts a list of glyphs, not characters.) It precedes a series of glyph descriptions, one per line. Each such glyph description comprises a set of fields separated by spaces or tabs and organized as follows. name metrics type code [entity‐name] [-- comment] [...] The entity‐name field defines an identifier for the glyph that the postprocessor uses to print the troff glyph name. This field is optional; it was introduced so that the grohtml output driver could encode its character set. For example, the glyph \[Po] is represented by “” in HTML 4.0. For efficiency, these data are now compiled directly into grohtml. grops uses the field to build sub‐encoding arrays for PostScript fonts containing more than 256 glyphs. Anything on the line after the entity‐name field or “--” is ignored. The presence of 2 adjacent optional fields seems to me fairly close to making the glyph descriptions formally undecidable. In practice, they're not, until and unless someone decides to name their "entity" "--"... (We don't actually tell anyone they're not allowed to do that.) As I understand it, this feature is largely a consequence of the implementation of grohtml 20-25 years ago, where an "entity" in HTML 4 and XHTML 1 was a well-defined thing. We might do well to tighten the semantics and format of this optional fifth field a bit more. More esteemed *roffers than I have stumbled over our documentation's unfortunate tendency to sometimes toss the term "entity" around, unmoored from any formal definition in the *roff language. https://lists.gnu.org/archive/html/groff/2023-04/msg2.html While I'm complaining about hazy terminology that exacerbates my hazy understanding of things, I'll observe that I don't understand what the verb "to sub-encode" means. I suspect there are better words to express what this is trying to get across. If I understood what grops was actually doing here, I'd try to find those words. > As to whether I should embed the table, or read it in, I deferred to > the more efficient method use by afmtodit, embed it as part of make. I > still would prefer the extra column solution, then there is no lookup > at all. I don't object to the idea, but I think our design decisions should be documented, and it frequently seems to fall to me to undertake the documentation. That means I have to ask a lot of questions, which programmers sometimes interpret as critique. (And, to be fair, sometimes I actually _do_ have critiques of an implementation.) > use_charnames_in_special > > Probably unnecessary once you complete the work to return .device to > its 1.23.0 condition, as you have stated. That seems like a fair prediction. Almost all of the logic _on the formatter side_ that employs this parameter seems to be in one function, `encode_char()`. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n5427 (Last month, I renamed that to `encode_char_for_troff_output()` and I'm thinking it can be further improved, more like `encode_char_for_device_control()`... ...there's just one more thing. There's one other occurrence, in a constructor. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n293 I look forward to someday understanding what that's doing there.) > pdfmomclean > > Not quite sure how your work on #64484 will affect this, we will have > to wait and see. Fair enough. > Stringhex > > Clearly you are still misunderstanding the issue, because there are > some incorrect statements. Okay. > In any lookup there is a key/value pair. I'm with ya so far. > If dealing with a document written in Japanese, both the key and
Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)
On Tuesday, 6 February 2024 14:45:59 GMT G. Branden Robinson wrote: > Hi Deri, > > Now _does_ seem to be a good time to catch up on gropdf-ng merge status. > There were two things I knew were still unmerged: the slanted symbol > (SS) font support and the stringhex business, which has larger > consequences than I understood at first. > > At 2024-02-06T13:39:51+, Deri wrote: > > The current gropdf (in the master branch) does support UTF-16BE > > for pdf outlines (see attached pdf), but Branden has not released > > At this point it's a merge (to the master branch), not a release, but > true with that caveat. > > So let me take a crack at a code review. Hi Branden, Many thanks for your thoughts on my code. I shall reply in general terms since your grasp of some of the issues is rather hazy, as you admit. Huge AGL lookup table My least favourite solution, but you made me do it! The most elegant and efficient solution was to make a one line amendment to afmtodit which added an extra column to the groff font files which would have the UTF-16 code for that glyph. This would only affect devpdf and devps and I checked the library code groff uses to read its font files was not affected by an extra column. I also checked the buffer used would not cause an overflow. Despite this, you didn't like this solution, without giving a cogent reason, but suggesting a lookup table! As to whether I should embed the table, or read it in, I deferred to the more efficient method use by afmtodit, embed it as part of make. I still would prefer the extra column solution, then there is no lookup at all. use_charnames_in_special Probably unnecessary once you complete the work to return .device to its 1.23.0 condition, as you have stated. pdfmomclean Not quite sure how your work on #64484 will affect this, we will have to wait and see. Stringhex Clearly you are still misunderstanding the issue, because there are some incorrect statements. In any lookup there is a key/value pair. If dealing with a document written in Japanese, both the key and the value will arrive as unicode. No problem for the value, but the key will be invalid if used as part of a register name. There are two obvious solutions. One is to encode the key into something, easily decoded, which is acceptable to be used as part of a register name, or do a loop traversal over two arrays, one holding the keys and one the values. I'm pretty sure my 9yr old grandson would come up with a looping solution. I really don't understand your opposition to the encoding solution, Ok, I accept you would have done it the childs way with the performance hit, but I prefer the more elegant encoding solution. Uniqueness of keys is an issue for either strategy. In mom, a user supplied key name is only possible by using the NAMED parameter, and if a user uses the same name twice in the document nothing nasty will happen, the overview panel will be correct, since each of those is tagged with a safe generated name, and if they have used the same name for two different places in the document, when they are checking all the intra-document links they will find one of them will go to the wrong place. Of course this could be improved by warning when the same name is provided for a different destination. The man/mdoc macros currently have no named destinations, all generated, but this will change if the mdoc section referencing is implemented. You mention a possible issue if a diversion is passed to stringhex, since this is 95% your own code for stringup/down, I'm pretty sure that whatever you do to solve the issue in your own code can be equally applied to stringhex, so this not an argument you can use to prevent its inclusion. As regards your point 2, this is a non-issue, in 1.23.0 it works fine with .device. You ask what does:- \X'pdf: bizzarecmd \[u1234]' Mean? Well, assuming you are writing in the ethiopic language and wrote:- \X'pdf: bizzarecmd ሴ' And gropdf would do a "bizzarecmd" using the CHARACTER given (ETHIOPIC SYLLABLE SEE). Which could be setting a window title in the pdf viewer, I'm not sure, I have not written a handler for bizzarecmd. As you can see not "misleading to a novice" at all, the fact that preconv changed it to be a different form and gropdf changed it back to a character to use in pdf meta- data is completely transparent to the user. Your work on \X and .device is to put .device back to how it was in 1.23.0 and alter \X to be the same, this is what you said would happen. The purpose of my patch was intended to give Robin a robust solution to what he wanted to do. You wrote in another email:- "But tparm(const char *str, long, long, long, long, long, long, long, long, long) is one of the worst things I've ever seen in C code. As I just got done saying (more or less) to Deri, when you have to obfuscate your inputs to cram them into the data structure you're using, that's a sign that you're using
gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)
Hi Deri, Now _does_ seem to be a good time to catch up on gropdf-ng merge status. There were two things I knew were still unmerged: the slanted symbol (SS) font support and the stringhex business, which has larger consequences than I understood at first. At 2024-02-06T13:39:51+, Deri wrote: > The current gropdf (in the master branch) does support UTF-16BE > for pdf outlines (see attached pdf), but Branden has not released At this point it's a merge (to the master branch), not a release, but true with that caveat. So let me take a crack at a code review. diff --git a/contrib/mom/om.tmac b/contrib/mom/om.tmac index d3b5002a8..87d9ba3cb 100644 --- a/contrib/mom/om.tmac +++ b/contrib/mom/om.tmac @@ -4906,7 +4906,7 @@ y\R'#DESCENDER \\n[.cdp]' .ds $AUTHOR \\*[$AUTHOR_1] .substring $AUTHORS 0 -2 .ds PDF_AUTHORS \\*[$AUTHORS] -.pdfmomclean PDF_AUTHORS +.if '\\*[.T]'ps' .pdfmomclean PDF_AUTHORS .nop \!x X ps:exec [/Author (\\*[PDF_AUTHORS]) /DOCINFO pdfmark .END . @@ -23512,13 +23512,13 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] on page \\n[#P]. . el .nr LEVEL_REQ \\n[CURRENT_LEVEL] . \} . ds PDF_TX \\$* -. pdfmomclean PDF_TX . nr PDF_LEV (\\n[LEVEL_REQ]*\\n[#PDF_BOOKMARKS_OPEN]) . ie '\\*[.T]'ps' \{\ . if !'\\*[PDF_NM]'' \{\ . pdfhref M -N \\*[PDF_NM2] -- \\*[PDF_TX] . if !dpdf:href.map .tm gropdf-info:href \\*[PDF_NM2] \\*[PDF_TX] . \} +. pdfmomclean PDF_TX . pdfbookmark \\n[PDF_LEV] \\*[PDF_TX] . \} . el .pdfbookmark \\*[PDF_NM] \\n[PDF_LEV] \\$* @@ -23539,7 +23539,7 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] on page \\n[#P]. \# .MAC PDF_TITLE END .ds pdftitle \\$* -.pdfmomclean pdftitle +.if '\\*[.T]'ps' .pdfmomclean pdftitle .nop \!x X ps:exec [/Title (\\*[pdftitle]) /DOCINFO pdfmark .END \# I hope to made "pdfmomclean" unnecessary with my revised fix for Savannah #64484.[1] Or at least enabled it to be shorter and simpler. @@ -23612,8 +23612,10 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] on page \\n[#P]. .if '\\*[PDF_AST]'*' \{\ .chop PDF_TXT .ie '\\*[.T]'pdf' \{\ -. ie d pdf:look(\\*[PDF_NM]) \ -. as PDF_TXT \&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM])]\\*[PDF_AST_Q] +. ds PDF_NM_HEX \\*[PDF_NM] +. stringhex PDF_NM_HEX +. ie d pdf:look(\\*[PDF_NM_HEX]) \ +. as PDF_TXT \&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM_HEX])]\\*[PDF_AST_Q] . el \{\ . as PDF_TXT Unknown . if !rPDF_UNKNOWN .tm \ In our discussions, significant confusion (mostly mine, I guess) has surrounded "stringhex". There are two distinct problems, as I understand it. 1. To date in groff development, PDF bookmarks generally get named using the text they're associated with. In macro packages there is a tendency to keep track of the bookmarks by defining strings for them. That's not a problem. The problem is that the associated text is made _part of the string identifier_ (probably because this [a] is easy to implement and [b] results in O(1) lookup time given the way the formatter manages *roff strings). The trouble is that a special character escape sequence is not valid in a *roff identifier. So while "pdf:look(ABC)" is a valid identifier, "pdf:look(\[*a]\[*b]\[*c])" is not, and Unicode special character escape sequences are no different. In my opinion, it is not a good design to encode the bookmark text directly into the name of the *roff identifier like this. A. We have the problem above. B. How do you ensure uniqueness of these strings? What if I have multiple places in a document titled, say "Exercises"? An alternative approach would be to store the bookmark IDs in strings indexed by a serial number. A *roff autoincrementing register is an obvious mechanism for doing this. When you need to look up the bookmark, you call a macro that searches the collected ideas until a match is found. Back-of-napkin sketch: .\" Search for bookmark text matching $1. Find matching bookmark .\" number in pdf*found if found, otherwise -1 for failure. bookmark if found, -1 .de pdf:look . nr pdf*found -1 . nr pdf*search-index 0 1 . while (\\n+[pdf*search-index] < \\n[pdf*max-index]) .if '\\*[pdf*bookmark!\\n[pdf*search-index]'\\$1' \{\ . nr pdf*found \\n[pdf*search-index] . break . \} .. Yes, this is an O(n) search and yes, we still have the uniqueness problem. Still another approach is to hash the bookmark identifier in some way, and that is more or less what you're doing with `stringhex`. (Strictly, it's not a hash, but an encoding.) This is back to O(1) lookup time, which is good, but I regret the