Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread G. Branden Robinson
Hi Deri,

At 2024-02-06T21:35:05+, Deri wrote:
> Many thanks for your thoughts on my code. I shall reply in general
> terms since your grasp of some of the issues is rather hazy, as you
> admit.

I generally don't feel I grasp code of nontrivial complexity until I've
documented it and written tests for it, and often not even then.  I'm a
bear of very little brain!

> Huge AGL lookup table
> 
> My least favourite solution, but you made me do it! The most elegant
> and efficient solution was to make a one line amendment to afmtodit
> which added an extra column to the groff font files which would have
> the UTF-16 code for that glyph. This would only affect devpdf and
> devps and I checked the library code groff uses to read its font files
> was not affected by an extra column. I also checked the buffer used
> would not cause an overflow. Despite this, you didn't like this
> solution, without giving a cogent reason,  but suggesting a lookup
> table!

I remember expressing unease with the "new column" approach, not
rejection.  The main reason is the documented format of the lines in
question.

groff_font(5):
 The directive charset starts the character set subsection.  (On
 typesetters, this directive is misnamed since it starts a list of
 glyphs, not characters.)  It precedes a series of glyph
 descriptions, one per line.  Each such glyph description comprises
 a set of fields separated by spaces or tabs and organized as
 follows.

name metrics type code [entity‐name] [-- comment]

[...]

 The entity‐name field defines an identifier for the glyph that the
 postprocessor uses to print the troff glyph name.  This field is
 optional; it was introduced so that the grohtml output driver could
 encode its character set.  For example, the glyph \[Po] is
 represented by “” in HTML 4.0.  For efficiency, these data
 are now compiled directly into grohtml.  grops uses the field to
 build sub‐encoding arrays for PostScript fonts containing more than
 256 glyphs.  Anything on the line after the entity‐name field or
 “--” is ignored.

The presence of 2 adjacent optional fields seems to me fairly close to
making the glyph descriptions formally undecidable.  In practice,
they're not, until and unless someone decides to name their "entity"
"--"...  (We don't actually tell anyone they're not allowed to do that.)

As I understand it, this feature is largely a consequence of the
implementation of grohtml 20-25 years ago, where an "entity" in HTML 4
and XHTML 1 was a well-defined thing.  We might do well to tighten the
semantics and format of this optional fifth field a bit more.

More esteemed *roffers than I have stumbled over our documentation's
unfortunate tendency to sometimes toss the term "entity" around,
unmoored from any formal definition in the *roff language.

https://lists.gnu.org/archive/html/groff/2023-04/msg2.html

While I'm complaining about hazy terminology that exacerbates my hazy
understanding of things, I'll observe that I don't understand what the
verb "to sub-encode" means.  I suspect there are better words to express
what this is trying to get across.  If I understood what grops was
actually doing here, I'd try to find those words.

> As to whether I should embed the table, or read it in, I deferred to
> the more efficient method use by afmtodit, embed it as part of make. I
> still would prefer the extra column solution, then there is no lookup
> at all.

I don't object to the idea, but I think our design decisions should be
documented, and it frequently seems to fall to me to undertake the
documentation.  That means I have to ask a lot of questions, which
programmers sometimes interpret as critique.  (And, to be fair,
sometimes I actually _do_ have critiques of an implementation.)

> use_charnames_in_special
> 
> Probably unnecessary once you complete the work to return .device to
> its 1.23.0 condition, as you have stated.

That seems like a fair prediction.  Almost all of the logic _on the
formatter side_ that employs this parameter seems to be in one function,
`encode_char()`.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n5427

(Last month, I renamed that to `encode_char_for_troff_output()` and I'm
thinking it can be further improved, more like
`encode_char_for_device_control()`...

...there's just one more thing.

There's one other occurrence, in a constructor.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n293

I look forward to someday understanding what that's doing there.)

> pdfmomclean
> 
> Not quite sure how your work on #64484 will affect this, we will have
> to wait and see.

Fair enough.

> Stringhex
> 
> Clearly you are still misunderstanding the issue, because there are
> some incorrect statements.

Okay.

> In any lookup there is a key/value pair.

I'm with ya so far.

> If dealing with a document written in Japanese, both the key and 

Re: Re: PDF outline not capturing Cyrillic text

2024-02-06 Thread Robin Haberkorn
On Tue, Feb 06, 2024 at 01:39:51PM +, Deri wrote:
> Hi Robin,
> 
> The current gropdf (in the master branch) does support UTF-16BE for pdf 
> outlines (see attached pdf), but Branden has not released the other parts to 
> make it work! If you can compile and install the current git the applying the 
> attached patch should give you what you want.
> 
> To apply the patch, cd into the git groff directory and "patch -p1 < path-to-
> patch-file", and then run make and install as usual.
> 
> I would be very interested in how you get on, and whether it gives you what 
> you need. Note that I am assuming you are feeding groff a file in UTF-8 and 
> the -k flag. I can see some hyphenation happening, but I don't know if it is 
> correct.
> 
> Cheers 
> 
> Deri

Hello Deri!

This patch works. All the outline titles are correct and .pdfinfo /Title,
/Author etc. also work with Cyrillic.
That's very cool.
But it only works when using UTF-8 as the input encoding (-Kutf-8).
As reported earlier in the correponding Savannah ticket, even hyphenation
works with UTF-8 input and I see no difference to the hyphenation result
compared to KOI-8 input. I have no idea how you did this.
Still, when using UTF-8 input, there are problems (missing letters) with
link texts autogenerated by .pdfhref L.
With KOI-8 input, all the outlines are incomprehensible, ie. they consist of
крокозябры as it would be called in Russian. ;-)
Apparently gropdf does not know, it has to convert from KOI-8 instead of UTF-8.

So I am still going to disable the outlines for the time being and go with
KOI-8.
It's anyway more of a nice to have thing, rather than a necessity.
I need Russian support as I am writing my master's thesis in Russian.
At the end of the day, this will be printed, so I can live without
PDF outlines.

Best regards,
Robin

PS: And to comment on some of the heated discussions on this list:
It's great that you and Branden spend so much time on improving Groff.
I think, you do a great job. Regressions are sometimes unavoidable,
especially when taking over a large code base from somebody else.



Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread Deri
On Tuesday, 6 February 2024 14:45:59 GMT G. Branden Robinson wrote:
> Hi Deri,
> 
> Now _does_ seem to be a good time to catch up on gropdf-ng merge status.
> There were two things I knew were still unmerged: the slanted symbol
> (SS) font support and the stringhex business, which has larger
> consequences than I understood at first.
> 
> At 2024-02-06T13:39:51+, Deri wrote:
> > The current gropdf (in the master branch) does support UTF-16BE
> > for pdf outlines (see attached pdf), but Branden has not released
> 
> At this point it's a merge (to the master branch), not a release, but
> true with that caveat.
> 
> So let me take a crack at a code review.

Hi Branden,

Many thanks for your thoughts on my code. I shall reply in general terms since 
your grasp of some of the issues is rather hazy, as you admit.

Huge AGL lookup table

My least favourite solution, but you made me do it! The most elegant and 
efficient solution was to make a one line amendment to afmtodit which added an 
extra column to the groff font files which would have the UTF-16 code for that 
glyph. This would only affect devpdf and devps and I checked the library code 
groff uses to read its font files was not affected by an extra column. I also 
checked the buffer used would not cause an overflow. Despite this, you didn't 
like this solution, without giving a cogent reason,  but suggesting a lookup 
table!

As to whether I should embed the table, or read it in, I deferred to the more 
efficient method use by afmtodit, embed it as part of make. I still would 
prefer the extra column solution, then there is no lookup at all.

use_charnames_in_special

Probably unnecessary once you complete the work to return .device to its 
1.23.0 condition, as you have stated.

pdfmomclean

Not quite sure how your work on #64484 will affect this, we will have to wait 
and see.

Stringhex

Clearly you are still misunderstanding the issue, because there are some 
incorrect statements.

In any lookup there is a key/value pair. If dealing with a document written in 
Japanese, both the key and the value will arrive as unicode. No problem for 
the value, but the key will be invalid if used as part of a register name. 
There are two obvious solutions. One is to encode the key into something, 
easily decoded, which is acceptable to be used as part of a register name, or 
do a loop traversal over two arrays, one holding the keys and one the values. 
I'm pretty sure my 9yr old grandson would come up with a looping solution. I 
really don't understand your opposition to the encoding solution, Ok, I accept 
you would have done it the childs way with the performance hit, but I prefer 
the more elegant encoding solution.

Uniqueness of keys is an issue for either strategy. In mom, a user supplied 
key name is only possible by using the NAMED parameter, and if a user uses the 
same name twice in the document nothing nasty will happen, the overview panel 
will be correct, since each of those is tagged with a safe generated name, and 
if they have used the same name for two different places in the document, when 
they are checking all the intra-document links they will find one of them will 
go to the wrong place. Of course this could be improved by warning when the 
same name is provided for a different destination. The man/mdoc macros 
currently have no named destinations, all generated, but this will change if 
the mdoc section referencing is implemented.

You mention a possible issue if a diversion is passed to stringhex, since this 
is 95% your own code for stringup/down, I'm pretty sure that whatever you do 
to solve the issue in your own code can be equally applied to stringhex, so 
this not an argument you can use to prevent its inclusion.

As regards your point 2, this is a non-issue, in 1.23.0 it works fine with 
.device. You ask what does:-

\X'pdf: bizzarecmd \[u1234]'

Mean? Well, assuming you are writing in the ethiopic language and wrote:-

\X'pdf: bizzarecmd ሴ'

And gropdf would do a "bizzarecmd" using the CHARACTER given (ETHIOPIC 
SYLLABLE SEE). Which could be setting a window title in the pdf viewer, I'm 
not sure, I have not written a handler for bizzarecmd. As you can see not 
"misleading to a novice" at all, the fact that preconv changed it to be a 
different form and gropdf changed it back to a character to use in pdf meta-
data is completely transparent to the user.

Your work on \X and .device is to put .device back to how it was in 1.23.0 and 
alter \X to be the same, this is what you said would happen.

The purpose of my patch was intended to give Robin a robust solution to what 
he wanted to do.

You wrote in another email:-

"But tparm(const char *str, long, long, long, long, long, long, long,
long, long) is one of the worst things I've ever seen in C code.

As I just got done saying (more or less) to Deri, when you have to
obfuscate your inputs to cram them into the data structure you're using,
that's a sign that you're using 

[bug #60930] Integrate Peter Schaffter's font installer script into groff

2024-02-06 Thread Peter Schaffter
Follow-up Comment #13, bug#60930 (group groff):


[comment #12 comment #12:]
> [comment #11 comment #11:]
> > There's no copyright affixed to the script, but I can add one and
> > slap on a GPL notice if that would help.  Let me know.  There'll
> > probably have to be a copyright assignment to the Free Software
> > Foundation as well.  I don't keep up with the legalities, I'm afraid.
> 
> Nor I; someone more conversant with licensing will have to address this.  I
only brought it up as part of a list of potential obstacles I could see to
getting it into the package.

I've put a copyright notice (2024--I can't remember when I wrote the damned
thing) and GPL licence on the script and uploaded it to
https://www.schaffter.ca/mom/bin/install-font.sh

Whoever wants to can grab it from there.  I believe Bertrand, as groff
maintainer, has to initiate the copyright assignment process.


___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




Re: .bp not working in groff 1.23.0 when it worked fine in 1.22.4

2024-02-06 Thread G. Branden Robinson
At 2024-02-05T13:11:04-0600, Dave Kemper wrote:
> On 2/5/24, G. Branden Robinson  wrote:
> > As far as I know, groff has never extended AT troff syntax in _this_
> > respect.
> >
> > The argument count to requests (unlike macros) is seemingly sacrosanct.
> 
> Groff extended the .ss request by adding an optional second parameter
> where AT's took only one.  It's not exactly the same situation, but
> would seem to cross the same minefield.

Fair!  I forgot about this.  Before posting, I scanned down the request
list in groff(7) to protect myself from embarrassment--uselessly.

Witness the power of Cunningham's Law.

I'm still not sure extending `ns` is the right idea, but my biggest
objection appears to have evaporated, unless we expect that way more
people use `ns` than `ss`,[1] and that we'll unleash havoc with this
extension where `ss` did not.

An alternative would be to have a new "alternate" no-space mode macro,
probably named "ns1" (another naming convention I really dislike, but am
hard pressed to improve upon).

I'm not sure which I like less.

Regards,
Branden

[1] I suspect more _macro packages_ do.  Does it matter?  I don't know.


signature.asc
Description: PGP signature


Re: [PATCH v3] [grotty]: Use terminfo.

2024-02-06 Thread G. Branden Robinson
At 2024-02-05T23:57:08+, Lennart Jablonka wrote:
> Quoth G. Branden Robinson:
> > Fortunately, ncurses is attentively maintained.  So instead of
> > sucking up a whole bunch of oxygen there as I did with groff, I
> > found that it behooves me to actually read the X/Open Issue 7
> > standard for curses.
> > 
> > That is a 420-page document, so it's taking some time to absorb.
> 
> Since you wrote this, every now and then I’ve looked at the
> ncurses-bug archives, scrolling through some of your patches.  That’s
> a little fun.

I said something the other day about eating elephants.

The ncurses documentation is a _mastodon_.

> If you haven’t finished rewriting the ncurses manual yet,

I haven't.  That's a long way off.  However my work there has been on
pause for a couple of weeks.

> are you perchance done absorbing X/Open Curses?

No, but...

> You know, just in case you want to take another look at that patch I
> sent a while ago.  For grotty.  To use terminfo.

I do, and I think I'm adequately prepared to do so.  When perusing
the ncurses documentation I focused first on the low level stuff.
Forms, menus, and panels will come much later (after *curses proper),
and I don't remotely need any of that to understand how grotty needs to
talk to terminfo.

So I will take another look, yes.

But tparm(const char *str, long, long, long, long, long, long, long,
long, long) is one of the worst things I've ever seen in C code.

As I just got done saying (more or less) to Deri, when you have to
obfuscate your inputs to cram them into the data structure you're using,
that's a sign that you're using the wrong data structure.

And early C programmers sure did have a terror of passing structs on the
stack.  (They were _so_ scared of structs that as soon as they declared
one, they concealed it with a typedef, which isn't a type definition but
a type _alias_, and some day I will yell this at Dennis Ritchie's
grave.)

I don't know if this is because compiler support was bad or because
that's just how coders rolled in the late 1970s and early 1980s.  Who
needs a data structure when you can use a dozen global primitives?  For
all the stick C coders of that generation gave to poor bastards whose
first programming was language was Microsoft BASIC (and who
consequently, it was held, were forever crippled in programming
ability), C sure did seem to carry quite a population of advocates whose
main appreciation of C's advantages over MS BASIC seemed to be that the
former didn't need line numbers.

:-|

goto fail;

Regards,
Branden


signature.asc
Description: PGP signature


gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread G. Branden Robinson
Hi Deri,

Now _does_ seem to be a good time to catch up on gropdf-ng merge status.
There were two things I knew were still unmerged: the slanted symbol
(SS) font support and the stringhex business, which has larger
consequences than I understood at first.

At 2024-02-06T13:39:51+, Deri wrote:
> The current gropdf (in the master branch) does support UTF-16BE
> for pdf outlines (see attached pdf), but Branden has not released

At this point it's a merge (to the master branch), not a release, but
true with that caveat.

So let me take a crack at a code review.

diff --git a/contrib/mom/om.tmac b/contrib/mom/om.tmac
index d3b5002a8..87d9ba3cb 100644
--- a/contrib/mom/om.tmac
+++ b/contrib/mom/om.tmac
@@ -4906,7 +4906,7 @@ y\R'#DESCENDER \\n[.cdp]'
 .ds $AUTHOR \\*[$AUTHOR_1]
 .substring $AUTHORS 0 -2
 .ds PDF_AUTHORS \\*[$AUTHORS]
-.pdfmomclean PDF_AUTHORS
+.if '\\*[.T]'ps' .pdfmomclean PDF_AUTHORS
 .nop \!x X ps:exec [/Author (\\*[PDF_AUTHORS]) /DOCINFO pdfmark
 .END
 .
@@ -23512,13 +23512,13 @@ No room to start \\*[MN-pos] margin note 
#\\n[MN-curr] on page \\n[#P].
 .  el .nr LEVEL_REQ \\n[CURRENT_LEVEL]
 .   \}
 .   ds PDF_TX \\$*
-.   pdfmomclean PDF_TX
 .   nr PDF_LEV (\\n[LEVEL_REQ]*\\n[#PDF_BOOKMARKS_OPEN])
 .   ie '\\*[.T]'ps' \{\
 .   if !'\\*[PDF_NM]'' \{\
 .  pdfhref M -N \\*[PDF_NM2] -- \\*[PDF_TX]
 .  if !dpdf:href.map .tm gropdf-info:href \\*[PDF_NM2] \\*[PDF_TX]
 .   \}
+.   pdfmomclean PDF_TX
 .   pdfbookmark \\n[PDF_LEV] \\*[PDF_TX]
 .   \}
 .   el .pdfbookmark \\*[PDF_NM] \\n[PDF_LEV] \\$*
@@ -23539,7 +23539,7 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] 
on page \\n[#P].
 \#
 .MAC PDF_TITLE END
 .ds pdftitle \\$*
-.pdfmomclean pdftitle
+.if '\\*[.T]'ps' .pdfmomclean pdftitle
 .nop \!x X ps:exec [/Title (\\*[pdftitle]) /DOCINFO pdfmark
 .END
 \#

I hope to made "pdfmomclean" unnecessary with my revised fix for
Savannah #64484.[1]  Or at least enabled it to be shorter and simpler.

@@ -23612,8 +23612,10 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] 
on page \\n[#P].
 .if '\\*[PDF_AST]'*' \{\
 .chop PDF_TXT
 .ie '\\*[.T]'pdf' \{\
-.   ie d pdf:look(\\*[PDF_NM]) \
-.   as PDF_TXT 
\&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM])]\\*[PDF_AST_Q]
+.   ds PDF_NM_HEX \\*[PDF_NM]
+.   stringhex PDF_NM_HEX
+.   ie d pdf:look(\\*[PDF_NM_HEX]) \
+.   as PDF_TXT 
\&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM_HEX])]\\*[PDF_AST_Q]
 .   el \{\
 .   as PDF_TXT Unknown
 .   if !rPDF_UNKNOWN .tm \

In our discussions, significant confusion (mostly mine, I guess) has
surrounded "stringhex".  There are two distinct problems, as I
understand it.

1.  To date in groff development, PDF bookmarks generally get named
using the text they're associated with.  In macro packages there is
a tendency to keep track of the bookmarks by defining strings for
them.  That's not a problem.  The problem is that the associated
text is made _part of the string identifier_ (probably because this
[a] is easy to implement and [b] results in O(1) lookup time given
the way the formatter manages *roff strings).  The trouble is that a
special character escape sequence is not valid in a *roff
identifier.

So while "pdf:look(ABC)" is a valid identifier,
"pdf:look(\[*a]\[*b]\[*c])" is not, and Unicode special character
escape sequences are no different.

In my opinion, it is not a good design to encode the bookmark text
directly into the name of the *roff identifier like this.

A.  We have the problem above.

B.  How do you ensure uniqueness of these strings?  What if I have
multiple places in a document titled, say "Exercises"?

An alternative approach would be to store the bookmark IDs in
strings indexed by a serial number.  A *roff autoincrementing
register is an obvious mechanism for doing this.  When you need to
look up the bookmark, you call a macro that searches the collected
ideas until a match is found.

Back-of-napkin sketch:

.\" Search for bookmark text matching $1.  Find matching bookmark
.\" number in pdf*found if found, otherwise -1 for failure.
bookmark if found, -1
.de pdf:look
.  nr pdf*found -1
.  nr pdf*search-index 0 1
.  while (\\n+[pdf*search-index] < \\n[pdf*max-index])
.if '\\*[pdf*bookmark!\\n[pdf*search-index]'\\$1' \{\
.  nr pdf*found \\n[pdf*search-index]
.  break
.  \}
..

Yes, this is an O(n) search and yes, we still have the uniqueness
problem.

Still another approach is to hash the bookmark identifier in some
way, and that is more or less what you're doing with `stringhex`.
(Strictly, it's not a hash, but an encoding.)  This is back to O(1)
lookup time, which is good, but I regret the 

[bug #64155] specifying -fZD on command line generates warnings

2024-02-06 Thread G. Branden Robinson
Update of bug#64155 (group groff):

 Assigned to:gbranden => deri   

___

Follow-up Comment #22:

[comment #21 comment #21:]
> Hi Branden,
> 
> I've worked it out. This is the problem, I had a corrupt copy of U-TR in my
build directory

Ok.  That explains the different file size as noted in comment #20.

> so this code:-

>   if test -f $f; then \
> /usr/bin/install -c -m 644 $f /usr/share/groff/1.23.0/font/devpdf/$f; \
>   else \
> /usr/bin/install -c -m 644 ./font/devpdf/$f \
>   /usr/share/groff/1.23.0/font/devpdf/$f; \
>   fi; \


> Copied the corrupt file rather than the correct file in ./font/devpdf. What
is the purpose of this?

I don't know.  My name is on the 2 lines of the else branch per "git blame"
but closer inspection reveals that all I did was break the long lines.  Before
that was a commit by Bertrand.

https://git.savannah.gnu.org/cgit/groff.git/commit/?id=b101574cae1b3019d4109d72b81e4c0a33bb5a86

and before that it was a "Makefile.sub" written by...

...you.  ;-)

> I am going to leave this open so that you can reply to the above

I'm going to plead noninvolvement on this one.  :P

> and whether you think this change will affect mom's use of .fam.

I don't _think_ so, given that mom has her own system of managing fonts, and
part of her contract with the user, as I understand it, is that user will not
go behind her back and start invoking *roff requests.

But I'll add Peter to the CC so he can opine.  I trust he'll know if I'm
confounding anything mom does.

> Also, I am a bit surprised that you must be reading U-TR (rather than just
checking for its existence) and then when it fails (silently) it compains
about the family, which is confusing, rather than reporting a font as
corrupt.

I agree.  I'm not happy with that, either.  See the end of comment #15.


___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




[bug #64155] specifying -fZD on command line generates warnings

2024-02-06 Thread Deri James
Update of bug#64155 (group groff):

 Assigned to:deri => gbranden   

___

Follow-up Comment #21:

Hi Branden,

I've worked it out. This is the problem, I had a corrupt copy of U-TR in my
build directory so this code:-

  if test -f $f; then \
/usr/bin/install -c -m 644 $f /usr/share/groff/1.23.0/font/devpdf/$f; \
  else \
/usr/bin/install -c -m 644 ./font/devpdf/$f \
  /usr/share/groff/1.23.0/font/devpdf/$f; \
  fi; \

Copied the corrupt file rather than the correct file in ./font/devpdf. What is
the purpose of this?

I am going to leave this open so that you can reply to the above and whether
you think this change will affect mom's use of .fam.

Also, I am a bit surprised that you must be reading U-TR (rather than just
checking for its existence) and then when it fails (silently) it compains
about the family, which is confusing, rather than reporting a font as
corrupt.




___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/