subject:"PDF outline not capturing Cyrillic text"

Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

2024-02-07 Thread Peter Schaffter

On Wed, Feb 07, 2024, Deri wrote:
> It is with a heavy heart that I announce I shall be leaving the groff mailing 
> list, I'm finding it too much work.

Deri --

This news just unmade my day.  The value of gropdf and your
contributions to mom have been immeasurable.  I will miss your voice
on the list.

-- 
Peter Schaffter
https://www.schaffter.ca

Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

2024-02-07 Thread Dave Kemper

Deri,

I'm very sad to see your departure.  This list and the groff project
will sorely miss your knowledge, expertise, and coding skills.  I hope
you find a way to return soon.

Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

2024-02-07 Thread Deri

Hi Branden,

It is with a heavy heart that I announce I shall be leaving the groff mailing 
list, I'm finding it too much work. Yesterday, you managed over 2900 words in 
approx 80 minutes whilst also doing a code review, and probably 3 other things 
as well! I am so jealous my paltry sub 10 words a minutes (on a good day) just 
can't cope, particularly as you sometimes have difficulty getting the points I 
am trying to make and reply with points which are not relevant. An example was 
your "unease" with adding an extra field to afmtodit output, you pointed me to 
some documentation rather than a swift perusal of the code in afmtodit where 
you can see that it ALWAYS outputs 5 tab characters and never outputs -- so no 
comments. And talking of "unease", you wrote, in reply to my request for help 
with merging: "Sure.  Once we're _both_ happy with it!  :D", and this was 
eight months ago, so unease really does mean rejection. In November it 
became:-

> > 1.  Changing the format of font description files to add yet
> > another field, mapping character names to Unicode code points.
> > In the rest of groff, this is not necessary because we have
> > glyphuni.cpp.
> >
> >
> > https://git.savannah.gnu.org/cgit/groff.git/tree/src/libs/libgroff/
glyphuni
> > .cpp
> >
> > I'd like to honor the DRY principle here.  What's a good way
> > to achieve that?

Given that afmtodit does not use glyphuni.cpp (and can't) the DRY principle 
here means to let afmtodit plant the needed data in the font files for gropdf 
to use, but you didn't seem to see how irrelevant your comment was.

Anyway, enough of this useless banter. This is a joyful moment, I'm freeing up 
so much time to pursue other projects that will be equally rewarding as 
writing gropdf has been, like:-

Detection of bias in UK news channels

In the UK there is a legal obligation for "Due impartiality and due accuracy" 
(https://www.ofcom.org.uk/tv-radio-and-on-demand/broadcast-codes/broadcast-code/section-five-due-impartiality-accuracy).
 For the past 4 years, I have 
been converting the dvb-t subtitles for news channels into text using an OCR 
program I wrote. It's about time I fooled around with the data using NLP and 
see if it is possible to detect bias within the data, at a minimum I can 
extract statistics on the political persuasion of guests, but I've got a 
feeling I might be able to go further.

GB News, a right wing channel, keeps getting fined. I'd love to be able to 
write something which automatically emailed a complaint to Ofcom if it caught 
them breaking the rules, without having to watch the channel all day. :-)

My autobiography

Well, I've got the title - "A life more ordinary"!!

If I ever get the gropdf itch in the future, this is my todo list:-

A) Underlining text.

Peter asked if I could do this, ages ago because he has a method for 
postscript, from Tadziu. It is half done.

B) Watermarking

Given a pdf scale to full page size and place it under the groff output, or 
stamp, put it above it. I have worked out the last wrinkle. Normally, if you 
rotate the page with -P-l any pdfpic will be rotated as well, so that the 
picture orientation stays with the text orientation but the watermark 
orientation is controlled by the page orientation.

C) Ttf/otf in pdfs

This is a lot of work, but I was starting to get a handle on it. Incidentally, 
if I ever do get this done, the Tibet ligatures issue will be solved. The 
reason it seems to be Ok everywhere else except in groff, is because the 
"rules" for the ligature placement/resizing are in sub-tables within the ttf 
font file, but in the fontforge conversion to a pfa file most of this 
information is discarded because type 1 fonts have no concept of vertical 
adjustments so all that gets through is the horizontal adjustment which 
ensures the glyphs print over each other, but without the correct vertical 
adjustment/sizing. Still a lot of research to do.

I've just seen your last email with a lot of nice things, but sometimes you 
confuse "code review" with "design review". If someone wants to know how to 
get to the doctor it is not helpful to say "Well I would not start from here”. 
I have told you right from the beginning that all I needed was a way to pass 
anything to gropdf, and so I coded on the expectation I could receive anything 
and dealt with it appropriately. This is all working code. Later you expressed 
a preference for a method where you would clean the data within troff so I 
would not need to, but I already had working code and so far any alternative 
is vapourware, and the only pseudo code I've seen (a for loop with a flag to 
indicate whether the next item is a node or a character), with the expectation 
that nodes will be discarded, would not cut the mustard because I believe 
special characters (i.e. \[u] or \[em]) are actually held as nodes within 
troff so would be discarded as not a character. So the criticism is of my 
design, hardly what I call a code review and

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-07 Thread G. Branden Robinson

[self-follow-up]

Hi Deri,

One more thing occurred to me, because your last paragraph was sticking
in my mind and I think I figured out why.

At 2024-02-06T19:30:58-0600, G. Branden Robinson wrote:
> > I am quite sure there will be "bugs" in my code, it is fairly
> > complex, but subjecting it to a "code review" without even running
> > it to see if it does what it says on the box, is not helpful.
> 
> I think you've pretty badly mistaken my perspective.  One of the
> reasons I stick my long nose into your code in this way is because I
> don't worry that you won't produce correct results.  You have an
> established record of delivering solutions that work as advertised.

That you put code review into scare quotes gave me a sort of belated
pause.  It finally dawned on me that you might be regarding my
undertaking of such on your contributions as a form of insult.

It emphatically is not!

Some computer science luminary--unfortunately I cannot remember who at
the moment--made the observation that programming languages chiefly
exist so that human beings can communicate to each other about
programming.  (Maybe someone reading recollects who I mean.)  If PLs
were intended _solely_ for consumption by machines, we'd stick with
machine language...or maybe assembly.

At the places I have worked, and at sites like GitHub and GitLab where
people manage things like pull requests and merge requests, it is not
only common for people other than the code author to undertake code
review before attempting to run it themselves, it is expected that they
won't!

Part of this is due to the cultural expectation that the author of code
will have tested it.  But another aspect is that humans are actually
pretty bad at inferring (perfect) correctness from inspection of source
code.  We are indeed likely to assume that it does what is on the box.
What code review is good for--and I think I said this recently on this
list, but maybe it was someplace else--is for programmers to share
expertise and problem-solving techniques with each other, and also to
reinforce the team mentality that sustains successful software projects
above the very small scale.

So I would ask that you please try to adopt that perspective when a
person perceptibly studies your, or anyone else's code.  Not all code
is worthy of study.  The famous Lions book presenting the Sixth Edition
Unix kernel was not an insult to Thompson and Ritchie, but a high form
of flattery...and today that book stands as a monument in the field of
operating systems research as an exposition of a successful,
high-quality system.

At the time time, everybody had gripes about the Unix kernel and some
aspects of how it was written, and even designed.  This is how we learn,
individually and collectively.

So, if I pay your code some scrutiny, it is not out of hauteur, but
respect.  I look at your code because I want to work with you.

I'm appreciate what you've contributed to groff and am pleased by how
well-received your efforts continue to be.

Best regards,
Branden

signature.asc
Description: PGP signature

Re: PDF outline not capturing Cyrillic text

2024-02-07 Thread Deri

On Wednesday, 7 February 2024 01:07:37 GMT Robin Haberkorn wrote:
> Still, when using UTF-8 input, there are problems (missing letters) with
> link texts autogenerated by .pdfhref L.

[...]

> 
> Best regards,
> Robin
> 
> PS: And to comment on some of the heated discussions on this list:
> It's great that you and Branden spend so much time on improving Groff.
> I think, you do a great job. Regressions are sometimes unavoidable,
> especially when taking over a large code base from somebody else.

Hi Robin,

Many thanks for the kind words, although there will be some sad news later. :-
(

I wonder if you could send me a small example of .pdfhref L missing letters 
and the command you are using, I don't need the whole thesis, I would not 
understand it.

Cheers 

Deri

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread G. Branden Robinson

Hi Deri,

At 2024-02-06T21:35:05+, Deri wrote:
> Many thanks for your thoughts on my code. I shall reply in general
> terms since your grasp of some of the issues is rather hazy, as you
> admit.

I generally don't feel I grasp code of nontrivial complexity until I've
documented it and written tests for it, and often not even then.  I'm a
bear of very little brain!

> Huge AGL lookup table
> 
> My least favourite solution, but you made me do it! The most elegant
> and efficient solution was to make a one line amendment to afmtodit
> which added an extra column to the groff font files which would have
> the UTF-16 code for that glyph. This would only affect devpdf and
> devps and I checked the library code groff uses to read its font files
> was not affected by an extra column. I also checked the buffer used
> would not cause an overflow. Despite this, you didn't like this
> solution, without giving a cogent reason,  but suggesting a lookup
> table!

I remember expressing unease with the "new column" approach, not
rejection.  The main reason is the documented format of the lines in
question.

groff_font(5):
 The directive charset starts the character set subsection.  (On
 typesetters, this directive is misnamed since it starts a list of
 glyphs, not characters.)  It precedes a series of glyph
 descriptions, one per line.  Each such glyph description comprises
 a set of fields separated by spaces or tabs and organized as
 follows.

name metrics type code [entity‐name] [-- comment]

[...]

 The entity‐name field defines an identifier for the glyph that the
 postprocessor uses to print the troff glyph name.  This field is
 optional; it was introduced so that the grohtml output driver could
 encode its character set.  For example, the glyph \[Po] is
 represented by “” in HTML 4.0.  For efficiency, these data
 are now compiled directly into grohtml.  grops uses the field to
 build sub‐encoding arrays for PostScript fonts containing more than
 256 glyphs.  Anything on the line after the entity‐name field or
 “--” is ignored.

The presence of 2 adjacent optional fields seems to me fairly close to
making the glyph descriptions formally undecidable.  In practice,
they're not, until and unless someone decides to name their "entity"
"--"...  (We don't actually tell anyone they're not allowed to do that.)

As I understand it, this feature is largely a consequence of the
implementation of grohtml 20-25 years ago, where an "entity" in HTML 4
and XHTML 1 was a well-defined thing.  We might do well to tighten the
semantics and format of this optional fifth field a bit more.

More esteemed *roffers than I have stumbled over our documentation's
unfortunate tendency to sometimes toss the term "entity" around,
unmoored from any formal definition in the *roff language.

https://lists.gnu.org/archive/html/groff/2023-04/msg2.html

While I'm complaining about hazy terminology that exacerbates my hazy
understanding of things, I'll observe that I don't understand what the
verb "to sub-encode" means.  I suspect there are better words to express
what this is trying to get across.  If I understood what grops was
actually doing here, I'd try to find those words.

> As to whether I should embed the table, or read it in, I deferred to
> the more efficient method use by afmtodit, embed it as part of make. I
> still would prefer the extra column solution, then there is no lookup
> at all.

I don't object to the idea, but I think our design decisions should be
documented, and it frequently seems to fall to me to undertake the
documentation.  That means I have to ask a lot of questions, which
programmers sometimes interpret as critique.  (And, to be fair,
sometimes I actually _do_ have critiques of an implementation.)

> use_charnames_in_special
> 
> Probably unnecessary once you complete the work to return .device to
> its 1.23.0 condition, as you have stated.

That seems like a fair prediction.  Almost all of the logic _on the
formatter side_ that employs this parameter seems to be in one function,
`encode_char()`.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n5427

(Last month, I renamed that to `encode_char_for_troff_output()` and I'm
thinking it can be further improved, more like
`encode_char_for_device_control()`...

...there's just one more thing.

There's one other occurrence, in a constructor.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n293

I look forward to someday understanding what that's doing there.)

> pdfmomclean
> 
> Not quite sure how your work on #64484 will affect this, we will have
> to wait and see.

Fair enough.

> Stringhex
> 
> Clearly you are still misunderstanding the issue, because there are
> some incorrect statements.

Okay.

> In any lookup there is a key/value pair.

I'm with ya so far.

> If dealing with a document written in Japanese, both the key and

Re: Re: PDF outline not capturing Cyrillic text

2024-02-06 Thread Robin Haberkorn

On Tue, Feb 06, 2024 at 01:39:51PM +, Deri wrote:
> Hi Robin,
> 
> The current gropdf (in the master branch) does support UTF-16BE for pdf 
> outlines (see attached pdf), but Branden has not released the other parts to 
> make it work! If you can compile and install the current git the applying the 
> attached patch should give you what you want.
> 
> To apply the patch, cd into the git groff directory and "patch -p1 < path-to-
> patch-file", and then run make and install as usual.
> 
> I would be very interested in how you get on, and whether it gives you what 
> you need. Note that I am assuming you are feeding groff a file in UTF-8 and 
> the -k flag. I can see some hyphenation happening, but I don't know if it is 
> correct.
> 
> Cheers 
> 
> Deri

Hello Deri!

This patch works. All the outline titles are correct and .pdfinfo /Title,
/Author etc. also work with Cyrillic.
That's very cool.
But it only works when using UTF-8 as the input encoding (-Kutf-8).
As reported earlier in the correponding Savannah ticket, even hyphenation
works with UTF-8 input and I see no difference to the hyphenation result
compared to KOI-8 input. I have no idea how you did this.
Still, when using UTF-8 input, there are problems (missing letters) with
link texts autogenerated by .pdfhref L.
With KOI-8 input, all the outlines are incomprehensible, ie. they consist of
крокозябры as it would be called in Russian. ;-)
Apparently gropdf does not know, it has to convert from KOI-8 instead of UTF-8.

So I am still going to disable the outlines for the time being and go with
KOI-8.
It's anyway more of a nice to have thing, rather than a necessity.
I need Russian support as I am writing my master's thesis in Russian.
At the end of the day, this will be printed, so I can live without
PDF outlines.

Best regards,
Robin

PS: And to comment on some of the heated discussions on this list:
It's great that you and Branden spend so much time on improving Groff.
I think, you do a great job. Regressions are sometimes unavoidable,
especially when taking over a large code base from somebody else.

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread Deri

On Tuesday, 6 February 2024 14:45:59 GMT G. Branden Robinson wrote:
> Hi Deri,
> 
> Now _does_ seem to be a good time to catch up on gropdf-ng merge status.
> There were two things I knew were still unmerged: the slanted symbol
> (SS) font support and the stringhex business, which has larger
> consequences than I understood at first.
> 
> At 2024-02-06T13:39:51+, Deri wrote:
> > The current gropdf (in the master branch) does support UTF-16BE
> > for pdf outlines (see attached pdf), but Branden has not released
> 
> At this point it's a merge (to the master branch), not a release, but
> true with that caveat.
> 
> So let me take a crack at a code review.

Hi Branden,

Many thanks for your thoughts on my code. I shall reply in general terms since 
your grasp of some of the issues is rather hazy, as you admit.

Huge AGL lookup table

My least favourite solution, but you made me do it! The most elegant and 
efficient solution was to make a one line amendment to afmtodit which added an 
extra column to the groff font files which would have the UTF-16 code for that 
glyph. This would only affect devpdf and devps and I checked the library code 
groff uses to read its font files was not affected by an extra column. I also 
checked the buffer used would not cause an overflow. Despite this, you didn't 
like this solution, without giving a cogent reason,  but suggesting a lookup 
table!

As to whether I should embed the table, or read it in, I deferred to the more 
efficient method use by afmtodit, embed it as part of make. I still would 
prefer the extra column solution, then there is no lookup at all.

use_charnames_in_special

Probably unnecessary once you complete the work to return .device to its 
1.23.0 condition, as you have stated.

pdfmomclean

Not quite sure how your work on #64484 will affect this, we will have to wait 
and see.

Stringhex

Clearly you are still misunderstanding the issue, because there are some 
incorrect statements.

In any lookup there is a key/value pair. If dealing with a document written in 
Japanese, both the key and the value will arrive as unicode. No problem for 
the value, but the key will be invalid if used as part of a register name. 
There are two obvious solutions. One is to encode the key into something, 
easily decoded, which is acceptable to be used as part of a register name, or 
do a loop traversal over two arrays, one holding the keys and one the values. 
I'm pretty sure my 9yr old grandson would come up with a looping solution. I 
really don't understand your opposition to the encoding solution, Ok, I accept 
you would have done it the childs way with the performance hit, but I prefer 
the more elegant encoding solution.

Uniqueness of keys is an issue for either strategy. In mom, a user supplied 
key name is only possible by using the NAMED parameter, and if a user uses the 
same name twice in the document nothing nasty will happen, the overview panel 
will be correct, since each of those is tagged with a safe generated name, and 
if they have used the same name for two different places in the document, when 
they are checking all the intra-document links they will find one of them will 
go to the wrong place. Of course this could be improved by warning when the 
same name is provided for a different destination. The man/mdoc macros 
currently have no named destinations, all generated, but this will change if 
the mdoc section referencing is implemented.

You mention a possible issue if a diversion is passed to stringhex, since this 
is 95% your own code for stringup/down, I'm pretty sure that whatever you do 
to solve the issue in your own code can be equally applied to stringhex, so 
this not an argument you can use to prevent its inclusion.

As regards your point 2, this is a non-issue, in 1.23.0 it works fine with 
.device. You ask what does:-

\X'pdf: bizzarecmd \[u1234]'

Mean? Well, assuming you are writing in the ethiopic language and wrote:-

\X'pdf: bizzarecmd ሴ'

And gropdf would do a "bizzarecmd" using the CHARACTER given (ETHIOPIC 
SYLLABLE SEE). Which could be setting a window title in the pdf viewer, I'm 
not sure, I have not written a handler for bizzarecmd. As you can see not 
"misleading to a novice" at all, the fact that preconv changed it to be a 
different form and gropdf changed it back to a character to use in pdf meta-
data is completely transparent to the user.

Your work on \X and .device is to put .device back to how it was in 1.23.0 and 
alter \X to be the same, this is what you said would happen.

The purpose of my patch was intended to give Robin a robust solution to what 
he wanted to do.

You wrote in another email:-

"But tparm(const char *str, long, long, long, long, long, long, long,
long, long) is one of the worst things I've ever seen in C code.

As I just got done saying (more or less) to Deri, when you have to
obfuscate your inputs to cram them into the data structure you're using,
that's a sign that you're using

gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

2024-02-06 Thread G. Branden Robinson

Hi Deri,

Now _does_ seem to be a good time to catch up on gropdf-ng merge status.
There were two things I knew were still unmerged: the slanted symbol
(SS) font support and the stringhex business, which has larger
consequences than I understood at first.

At 2024-02-06T13:39:51+, Deri wrote:
> The current gropdf (in the master branch) does support UTF-16BE
> for pdf outlines (see attached pdf), but Branden has not released

At this point it's a merge (to the master branch), not a release, but
true with that caveat.

So let me take a crack at a code review.

diff --git a/contrib/mom/om.tmac b/contrib/mom/om.tmac
index d3b5002a8..87d9ba3cb 100644
--- a/contrib/mom/om.tmac
+++ b/contrib/mom/om.tmac
@@ -4906,7 +4906,7 @@ y\R'#DESCENDER \\n[.cdp]'
 .ds $AUTHOR \\*[$AUTHOR_1]
 .substring $AUTHORS 0 -2
 .ds PDF_AUTHORS \\*[$AUTHORS]
-.pdfmomclean PDF_AUTHORS
+.if '\\*[.T]'ps' .pdfmomclean PDF_AUTHORS
 .nop \!x X ps:exec [/Author (\\*[PDF_AUTHORS]) /DOCINFO pdfmark
 .END
 .
@@ -23512,13 +23512,13 @@ No room to start \\*[MN-pos] margin note 
#\\n[MN-curr] on page \\n[#P].
 .  el .nr LEVEL_REQ \\n[CURRENT_LEVEL]
 .   \}
 .   ds PDF_TX \\$*
-.   pdfmomclean PDF_TX
 .   nr PDF_LEV (\\n[LEVEL_REQ]*\\n[#PDF_BOOKMARKS_OPEN])
 .   ie '\\*[.T]'ps' \{\
 .   if !'\\*[PDF_NM]'' \{\
 .  pdfhref M -N \\*[PDF_NM2] -- \\*[PDF_TX]
 .  if !dpdf:href.map .tm gropdf-info:href \\*[PDF_NM2] \\*[PDF_TX]
 .   \}
+.   pdfmomclean PDF_TX
 .   pdfbookmark \\n[PDF_LEV] \\*[PDF_TX]
 .   \}
 .   el .pdfbookmark \\*[PDF_NM] \\n[PDF_LEV] \\$*
@@ -23539,7 +23539,7 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] 
on page \\n[#P].
 \#
 .MAC PDF_TITLE END
 .ds pdftitle \\$*
-.pdfmomclean pdftitle
+.if '\\*[.T]'ps' .pdfmomclean pdftitle
 .nop \!x X ps:exec [/Title (\\*[pdftitle]) /DOCINFO pdfmark
 .END
 \#

I hope to made "pdfmomclean" unnecessary with my revised fix for
Savannah #64484.[1]  Or at least enabled it to be shorter and simpler.

@@ -23612,8 +23612,10 @@ No room to start \\*[MN-pos] margin note #\\n[MN-curr] 
on page \\n[#P].
 .if '\\*[PDF_AST]'*' \{\
 .chop PDF_TXT
 .ie '\\*[.T]'pdf' \{\
-.   ie d pdf:look(\\*[PDF_NM]) \
-.   as PDF_TXT 
\&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM])]\\*[PDF_AST_Q]
+.   ds PDF_NM_HEX \\*[PDF_NM]
+.   stringhex PDF_NM_HEX
+.   ie d pdf:look(\\*[PDF_NM_HEX]) \
+.   as PDF_TXT 
\&\\*[PDF_AST_Q]\\*[pdf:look(\\*[PDF_NM_HEX])]\\*[PDF_AST_Q]
 .   el \{\
 .   as PDF_TXT Unknown
 .   if !rPDF_UNKNOWN .tm \

In our discussions, significant confusion (mostly mine, I guess) has
surrounded "stringhex".  There are two distinct problems, as I
understand it.

1.  To date in groff development, PDF bookmarks generally get named
using the text they're associated with.  In macro packages there is
a tendency to keep track of the bookmarks by defining strings for
them.  That's not a problem.  The problem is that the associated
text is made _part of the string identifier_ (probably because this
[a] is easy to implement and [b] results in O(1) lookup time given
the way the formatter manages *roff strings).  The trouble is that a
special character escape sequence is not valid in a *roff
identifier.

So while "pdf:look(ABC)" is a valid identifier,
"pdf:look(\[*a]\[*b]\[*c])" is not, and Unicode special character
escape sequences are no different.

In my opinion, it is not a good design to encode the bookmark text
directly into the name of the *roff identifier like this.

A.  We have the problem above.

B.  How do you ensure uniqueness of these strings?  What if I have
multiple places in a document titled, say "Exercises"?

An alternative approach would be to store the bookmark IDs in
strings indexed by a serial number.  A *roff autoincrementing
register is an obvious mechanism for doing this.  When you need to
look up the bookmark, you call a macro that searches the collected
ideas until a match is found.

Back-of-napkin sketch:

.\" Search for bookmark text matching $1.  Find matching bookmark
.\" number in pdf*found if found, otherwise -1 for failure.
bookmark if found, -1
.de pdf:look
.  nr pdf*found -1
.  nr pdf*search-index 0 1
.  while (\\n+[pdf*search-index] < \\n[pdf*max-index])
.if '\\*[pdf*bookmark!\\n[pdf*search-index]'\\$1' \{\
.  nr pdf*found \\n[pdf*search-index]
.  break
.  \}
..

Yes, this is an O(n) search and yes, we still have the uniqueness
problem.

Still another approach is to hash the bookmark identifier in some
way, and that is more or less what you're doing with `stringhex`.
(Strictly, it's not a hash, but an encoding.)  This is back to O(1)
lookup time, which is good, but I regret the

Re: PDF outline not capturing Cyrillic text

2024-02-03 Thread Robin Haberkorn

Regarding cyrillic characters in PDF outlines, I think I got a few
insights today.

It turns out that the pdfmarks in the postscript code are "text strings"
according to the PDF specs, that is either a PDFDocEncoding or
UTF-16BE with a leading byte-order marker (cf. PDF Reference 1.7).
A PDFDocEncoding is basically latin1 it seems.
This explains why the current code in MOM works with western European
languages.
Now, in order to include cyrillic, you will have to reencode whatever
encoding Groff uses and passes to the postprocessor - which will
subsequently end up in the postscript code - to UTF-16BE.
Everything needs to be hex-encoded and enclosed in sharp
brackets ().

In the most hacky case, this could be done by a script on the
postscript code generated by `pdfroff --emit-ps`. As a proof of concept
Here's an incomplete, but somewhat working version in SciTECO:

sciteco -e "16,0ED @EB/document.ps/ <@S|/Title (|; -D @I|/ D> 
@EW//"

This assumes that the Groff encoding is KOI8-R, which I chose as an
intermediate format in order to enable Russian hyphenation
(but that does not work unfortunately).
It should be rewritten into a Python or Perl script using some
iconv wrapper or ideally pdfroff itself could do it.
The script could even interpret Groff Unicode escapes generated by preconv
and convert them back to plain Unicode before writing out everything in UTF16.

I will probably just use such a hack for my purposes.

What's the status of pdfroff anyway? I read that it is more or less
deprecated and we should all use `groff -Tpdf` instead.
Actually, pdfmom should work with ms as well, actually uses
gropdf and should perform the necessary multipass processing
for pdfhref forward-references to work.
Will try this next!

Best regards,
Robin

Re: PDF outline not capturing Cyrillic text

2023-08-12 Thread Deri

On Saturday, 12 August 2023 07:35:20 BST G. Branden Robinson wrote:
> Hi Deri,
> 
> At 2023-06-23T22:40:42+0100, Deri wrote:
> > On Friday, 23 June 2023 19:17:58 BST Robin Haberkorn wrote:
> > > So it seems that the main problem really lies in grops and/or gropdf
> > > which should ideally work with the Unicode escapes produced by
> > > preconv.  I am not sure if we would still need .pdfmomclean. But
> > > whatever useful stuff it currently does, it should probably be in
> > > pdfmark.tmac (and/or pdf.tmac?) instead.
> 
> [...]
> 
> > The features you require are coming. This is an example of Russian
> > with bookmarks in cyrillic. I'm afraid I don't know what it means and
> > I have forgotten where I got the text.
> 
> [attachment stripped]
> 
> That's my "makhnovshchina.groff" demo from back in March.
> 
> https://lists.gnu.org/archive/html/groff/2023-03/msg00100.html
> 
> Regards,
> Branden

Hi Branden,

That's it. And here's the result using the deri-gropdf-ng branch.

Cheers 

Deri




maknovshchina.pdf
Description: Adobe PDF document

Re: PDF outline not capturing Cyrillic text

2023-08-12 Thread G. Branden Robinson

Hi Deri,

At 2023-06-23T22:40:42+0100, Deri wrote:
> On Friday, 23 June 2023 19:17:58 BST Robin Haberkorn wrote:
> > So it seems that the main problem really lies in grops and/or gropdf
> > which should ideally work with the Unicode escapes produced by
> > preconv.  I am not sure if we would still need .pdfmomclean. But
> > whatever useful stuff it currently does, it should probably be in
> > pdfmark.tmac (and/or pdf.tmac?) instead.
[...]
> The features you require are coming. This is an example of Russian
> with bookmarks in cyrillic. I'm afraid I don't know what it means and
> I have forgotten where I got the text.

[attachment stripped]

That's my "makhnovshchina.groff" demo from back in March.

https://lists.gnu.org/archive/html/groff/2023-03/msg00100.html

Regards,
Branden


signature.asc
Description: PGP signature

Re: PDF outline not capturing Cyrillic text

2023-06-23 Thread Robin Haberkorn

Hello Peter,

I am also now stumbling across Cyrillc-related issues with pdfmark. I am using
ms for the time being. The bug also affects autogenerating link texts given via
`.pdfhref L`.
In the most simple case, preconv will turn your Cyrillic characters into escapes
which are apparently not further interpreted by pdfmark (or anything that 
follows).
I see text like "[u0421][u043F]..." in my outline.

I believe that this is why you have .pdfmomclean in MOM. Do I understand
correctly that this is supposed to turn the escapes back into Latin-1?
This is presumably mainly the work of .asciify, which would be misnamed anyway.
It does not work with Cyrillic at all, which doesn't surprise.
That's also why you don't get "mojibake garbage" in the outline. None of the
Cyrillic characters end up in intermediate output.

It also explains why I previously had no problems with German Unicode characters
(that was using MOM) - they can be converted back into Latin-1.

Manually editing the ps:exec lines in the intermediate output and inserting
Unicode characters there, does not produce the desired results, which is also
not surprising.

So it seems that the main problem really lies in grops and/or gropdf which
should ideally work with the Unicode escapes produced by preconv.
I am not sure if we would still need .pdfmomclean. But whatever useful stuff it
currently does, it should probably be in pdfmark.tmac (and/or pdf.tmac?) 
instead.

Best regards,
Robin

Re: PDF outline not capturing Cyrillic text

2022-09-22 Thread G. Branden Robinson

[self-reply]
At 2022-09-18T09:37:32-0500, G. Branden Robinson wrote:
> GNU troff doesn't, as far as I can tell, ever write anything but the
> 94 graphical code points in ASCII, spaces, and newlines to its output.

I left out a sentence here; it should be clear from the rest of the
message but it might be wise to be explicit.

"Except in device control escape sequences."

Regards,
Branden


signature.asc
Description: PGP signature

Re: PDF outline not capturing Cyrillic text

2022-09-20 Thread Ralph Corderoy

Hi Branden,

> A shorter pole might be to establish a protocol for communication of
> Unicode code points within device control commands.  Portability isn't
> much of an issue here: as far as I know there has been no effort to
> achieve interoperation of device control escape sequences among
> troffs.
>
> That convention even _could_ be UTF-8, but my initial instinct is
> _not_ to go that way.  I like the 7-bit cleanliness of GNU troff
> output, and when I've mused about solving The Big Unicode Problem
> I have given strong consideration to preserving it, or enabling
> tricked-out UTF-8 "grout" only via an option for the kids who really
> like to watch their chrome rims spin.

Adding an option seems more needless complexity.
I am not a kid and have never had chrome rims.

> I realize that Heirloom and neatroff can both boast of this

I expect they just think it mundane.

> but how many people _really_ look at device-independent troff output?
> A few curious people, and the poor saps who are stuck developing and
> debugging the implementations, like me.  For the latter community,
> a modest and well-behaved format saves a lot of time.

I read it, diff(1) it, etc.  Skipping the device-specific rendering
often simplifies the comparison and removes another layer of potential
mud and error.

There's nothing great about the device-independent format being ASCII.
I strongly suggest using UTF-8 encoding for the Unicode runes that need
passing through to the device driver.  This will continue to make it
easy to read, grep, etc., and avoid yet another encoding format because
none of the existing ones are ‘good enough’.  The device drivers will
probably have UTF-8 parsing code to hand.

If groff ever reaches ‘UTF-8 everywhere’, an ad-hoc encoding for this
one thing will appear to be an anachronism when it is really a poor
recent decision.

-- 
Cheers, Ralph.

Re: PDF outline not capturing Cyrillic text

2022-09-18 Thread Oliver Corff


Dear Peter, Dear All,

this problem is presumably not limited to groff. I remember the same
issue when I was building LaTeX texts with foreign language elements (in
my case, among others: Chinese) with the package to create internal
links from (hyperref, iirc) table of contents to chapters,etc. As soon
as there were non-ASCII characters in the anchor, the result was
(halfway working) garbage.

My memory of this issue should only serve as a point of reference, I
think modern versions of hyperref can produce non-ASCII links and output.

Best regards,

Oliver.


On 17/09/2022 23:35, Peter Schaffter wrote:

Greetings, all.

Source documents written in Cyrillic and processed with mom/pdfmom
break the PDF outline:

1. The text of titles and headings is not displayed in the outline.

2. If there are items where English is automatically part of the
outline label (cover, title page), the label prints (minus the
text), but the linking hierarchy fails with some viewers if there
are intervening titles or headings, e.g.
  doc cover (Cyrillic)
  copyright notice (in English)
  table of contents (auto relocated)
  title page (Cyrillic)
  chapter title (Cyrillic)
  body text with Cyrillic headings
With the above arrangement, the outline in okular and evince shows
only
  Cover: (no text)
  Copyright:
  Title Page: (no text)
In okular, clicking on Copyright takes you to the correct page,
however in evince, clicking on Copyright takes you to the table
of contents page.

At a guess, it looks as if gropdf or pdfmark isn't recognizing Cyrillic
characters as valid input for creating pdf bookmarks.  I'm at a
loss as to how to overcome this.  Ideas?

Re: PDF outline not capturing Cyrillic text

2022-09-18 Thread G. Branden Robinson

Hi Peter,

At 2022-09-17T17:35:02-0400, Peter Schaffter wrote:
> Source documents written in Cyrillic and processed with mom/pdfmom
> break the PDF outline:
> 
> 1. The text of titles and headings is not displayed in the outline.
[...]
> At a guess, it looks as if gropdf or pdfmark isn't recognizing
> Cyrillic characters as valid input for creating pdf bookmarks.  I'm at
> a loss as to how to overcome this.  Ideas?

I have a hunch that this is our old friend "can't output node in
transparent throughput".

Five years into my tenure as a groff developer, I think I finally
understand what this inscrutable error message is on about.  However, I
recently disabled these diagnostics by default in groff Git.

Try regenerating the document with

  GROFF_ENABLE_TRANSPARENCY_WARNINGS=1

(actually, you can set the variable to anything) in your environment.

The problem, I think, is that PDF bookmark generation, like some other
advanced features, relies upon use of device control escape sequences.
That means '\X' stuff.

In device-independent output ("grout", as I term it), these become "x X"
commands, and the arguments to the escape sequence are, you'd think,
passed through as-is.

The trouble comes with the assumption people make about what "as-is"
means.

The problem is this: what if we want to represent a non-ASCII character
in the device control escape sequence?

groff's device-independent output is, up to a point, strictly ISO Basic
Latin, a property we inherited from AT troff.

We have the same problem with the requests that write to the standard
error stream, like `tm`.  I'm not sure that problem is worth solving;
groff's own diagnostic messages are not i18n-ed.  Even if it is worth
solving, teaching device control commands how to interpret more kinds of
"node" seems like a higher priority.

We don't have any infrastructure for handling any character encoding but
the default for input.  That's ISO Latin-1 for most platforms, but IBM
code page 1047 for OS/390 Unix (I think--no one who runs groff on such a
machine has ever spoken with me of their experiences).  And in practice
GNU troff doesn't, as far as I can tell, ever write anything but the 94
graphical code points in ASCII, spaces, and newlines to its output.

I imagine a lot of people's first instinct to fix this is to say, "just
give groff full Unicode support and enable input and output of UTF-8"!

That's a huge ask.

A shorter pole might be to establish a protocol for communication of
Unicode code points within device control commands.  Portability isn't
much of an issue here: as far as I know there has been no effort to
achieve interoperation of device control escape sequences among troffs.

That convention even _could_ be UTF-8, but my initial instinct is _not_
to go that way.  I like the 7-bit cleanliness of GNU troff output, and
when I've mused about solving The Big Unicode Problem I have given
strong consideration to preserving it, or enabling tricked-out UTF-8
"grout" only via an option for the kids who really like to watch their
chrome rims spin.  I realize that Heirloom and neatroff can both boast
of this, but how many people _really_ look at device-independent troff
output?  A few curious people, and the poor saps who are stuck
developing and debugging the implementations, like me.  For the latter
community, a modest and well-behaved format saves a lot of time.

Concretely, when I run the following command:

  GROFF_ENABLE_TRANSPARENCY_WARNINGS=1 ./test-groff -Z -mom -Tpdf -pet \
  -Kutf8 ../contrib/mom/examples/mon_premier_doc.mom

I get the following diagnostics familiar to all who have build groff
1.22.4 from source.

troff:../contrib/mom/examples/mon_premier_doc.mom:30: error: can't translate 
character code 233 to special character ''e' in transparent throughput
troff:../contrib/mom/examples/mon_premier_doc.mom:108: error: can't translate 
character code 233 to special character ''e' in transparent throughput
troff:../contrib/mom/examples/mon_premier_doc.mom:136: error: can't translate 
character code 232 to special character '`e' in transparent throughput

More tellingly, if I page the foregoing output with "less -R", I see
non-ASCII code points screaming out their rage in reverse video.

x X ps:exec [/Author (Cicron,) /DOCINFO pdfmark

x X ps:exec [/Dest /pdf:bm4 /Title (1. Les diffrentes versions) /Level 2 
/OUT pdfmark

x X ps:exec [/Dest /evolution /Title (2. Les volutions du Lorem) /Level 2 
/OUT pdfmark

x X ps:exec [/Dest /pdf:bm8 /Title (Table des matires) /Level 1 /OUT pdfmark

It therefore appears to me that the pdfmark extension to PostScript, or
PostScript itself, happily eats Latin-1...but that means that it eats
_only_ Latin-1, which forecloses the use of Cyrillic code points.

I'm a little concerned that we're blindly _feeding_ the device control
commands characters with the eighth bit set.  It's obviously a useful
expedient for documents like mon_premier_doc.mom.  I am curious to know
why instead of getting no text for

PDF outline not capturing Cyrillic text

2022-09-17 Thread Peter Schaffter

Greetings, all.

Source documents written in Cyrillic and processed with mom/pdfmom
break the PDF outline:

1. The text of titles and headings is not displayed in the outline.

2. If there are items where English is automatically part of the
   outline label (cover, title page), the label prints (minus the
   text), but the linking hierarchy fails with some viewers if there
   are intervening titles or headings, e.g.
 doc cover (Cyrillic)
 copyright notice (in English)
 table of contents (auto relocated)
 title page (Cyrillic)
 chapter title (Cyrillic)
 body text with Cyrillic headings
   With the above arrangement, the outline in okular and evince shows
   only
 Cover: (no text)
 Copyright:
 Title Page: (no text)
   In okular, clicking on Copyright takes you to the correct page,
   however in evince, clicking on Copyright takes you to the table
   of contents page.

At a guess, it looks as if gropdf or pdfmark isn't recognizing Cyrillic
characters as valid input for creating pdf bookmarks.  I'm at a
loss as to how to overcome this.  Ideas?

-- 
Peter Schaffter
https://www.schaffter.ca

Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text))

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

Re: PDF outline not capturing Cyrillic text

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

Re: Re: PDF outline not capturing Cyrillic text

Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)

Re: PDF outline not capturing Cyrillic text

Re: PDF outline not capturing Cyrillic text

Re: PDF outline not capturing Cyrillic text

Re: PDF outline not capturing Cyrillic text

Re: PDF outline not capturing Cyrillic text

Re: PDF outline not capturing Cyrillic text

Re: PDF outline not capturing Cyrillic text

Re: PDF outline not capturing Cyrillic text

PDF outline not capturing Cyrillic text

18 matches

Site Navigation

Mail list logo

Footer information