Re: Accent mystery

2024-02-21 Thread G. Branden Robinson
Hi Peter,

At 2024-02-21T13:42:58-0500, Peter Schaffter wrote:
> > It looks idiomatic enough to me.  You can expect this change in my
> > next push.  Thanks!
> 
> Deri sent me a patch this morning.  I've applied and tested it.
> Fixes the issue.  If you want, I can push the change.

Sure!

Here's what I had queued up; I offer it mainly for the changelog.

commit 69f4eba10487886681dba206b8e7071394cd3445
Author: Tadziu Hoffmann 
Date:   Wed Feb 21 01:27:43 2024 +0100

[pdfmom]: Improve `-k`, `-K` option handling.

* contrib/mom/pdfmom.pl: Accrete `-k` and `-K` options to groff instead
  of letting one clobber the other.

See message at
 and
follow-ups.

diff --git a/contrib/mom/ChangeLog b/contrib/mom/ChangeLog
index 79b8cb68b..8389cd65a 100644
--- a/contrib/mom/ChangeLog
+++ b/contrib/mom/ChangeLog
@@ -1,3 +1,12 @@
+2024-02-21  Tadziu Hoffmann 
+
+   * pdfmom.pl: Accrete `-k` and `-K` options to groff instead of
+   letting one clobber the other.
+
+   See message at
+   
+   and follow-ups.
+
 2024-01-31  G. Branden Robinson 
 
* examples/test-mom.sh.in: Fix code style/diagnostic nits.
diff --git a/src/devices/gropdf/pdfmom.pl b/src/devices/gropdf/pdfmom.pl
index 1c24d85b9..25c646cfc 100644
--- a/src/devices/gropdf/pdfmom.pl
+++ b/src/devices/gropdf/pdfmom.pl
@@ -68,18 +68,18 @@ while (my $c=shift)
 {
if (length($c) > 2)
{
-   $preconv=$c;
+   $preconv.=' '.$c;
}
else
{
-   $preconv=$c;
+   $preconv.=' '.$c;
$preconv.=shift;
}
next;
 }
 elsif (substr($c,0,2) eq '-k')
 {
-   $preconv=$c;
+   $preconv.=' '.$c;
next;
 }
 elsif ($c eq '-Z')

Regards,
Branden


signature.asc
Description: PGP signature


Re: Accent mystery

2024-02-21 Thread Peter Schaffter
Branden --

On Wed, Feb 21, 2024, G. Branden Robinson wrote:
> At 2024-02-21T01:27:43+0100, Tadziu Hoffmann wrote:
> > I guess the simplest fix would be to have pdfmom _append_ the
> > "-k" and "-K" options to $preconv, i.e., replacing
> > 
> >   $preconv=$c;
> > 
> > by
> > 
> >   $preconv.=' '.$c;
> > 
> > at all three occurrences (the space makes sure the option strings
> > don't run together).  But I have very little experience with
> > perl, perhaps a cleaner solution exists.
> 
> It looks idiomatic enough to me.  You can expect this change in my next
> push.  Thanks!

Deri sent me a patch this morning.  I've applied and tested it.
Fixes the issue.  If you want, I can push the change.

-- 
Peter Schaffter
https://www.schaffter.ca



Re: Accent mystery

2024-02-21 Thread G. Branden Robinson
Hi Tadziu!

At 2024-02-21T01:27:43+0100, Tadziu Hoffmann wrote:
> I guess the simplest fix would be to have pdfmom _append_ the
> "-k" and "-K" options to $preconv, i.e., replacing
> 
>   $preconv=$c;
> 
> by
> 
>   $preconv.=' '.$c;
> 
> at all three occurrences (the space makes sure the option strings
> don't run together).  But I have very little experience with
> perl, perhaps a cleaner solution exists.

It looks idiomatic enough to me.  You can expect this change in my next
push.  Thanks!

Regards,
Branden


signature.asc
Description: PGP signature


Re: Accent mystery

2024-02-20 Thread Tadziu Hoffmann


> However, pdfmom is supposed to accept all the same
> options as groff.  Here, it does not, since "-Kutf8 -k" is
> acceptable to groff.
> 
>   groff -Tpdf -Kutf8 -k -mom timeline.mom > timeline.pdf
> 
> works but
> 
>   pdfmom -Kutf8 -k timeline.mom > timeline.pdf
> 
> fails.

In the perl script, both "-k" and "-K" options _assign_ the
$preconv variable (which ultimately gets passed to groff),
so the second overwrites the first, leading to the encoding
info "utf8" getting lost.

However, "-Kutf8 -k" feels unnatural to me, whereas "-k -Kutf8"
feels reasonable.  If pdfmom was only tested with the latter
order of options, it would not have been noticed that "-k" got
lost, because it would be implied by groff seeing "-K".

I guess the simplest fix would be to have pdfmom _append_ the
"-k" and "-K" options to $preconv, i.e., replacing

  $preconv=$c;

by

  $preconv.=' '.$c;

at all three occurrences (the space makes sure the option strings
don't run together).  But I have very little experience with
perl, perhaps a cleaner solution exists.





Re: Accent mystery

2024-02-20 Thread Peter Schaffter
On Tue, Feb 20, 2024, Tadziu Hoffmann wrote:
> 
> > Processed with
> >   pdfmom -Kutf8 -k  timeline.mom > timeline.pdf
> > the é is garbage.
> 
> If I swap the order of the options:
> 
>   pdfmom -k -Kutf8 timeline.mom >timeline.pdf
> 
> or leave out the "-k" entirely (since it is implied by "-K"):
> 
>   pdfmom -Kutf8 timeline.mom >timeline.pdf
> 
> it works on my machine.

Same here.  However, pdfmom is supposed to accept all the same
options as groff.  Here, it does not, since "-Kutf8 -k" is
acceptable to groff.

  groff -Tpdf -Kutf8 -k -mom timeline.mom > timeline.pdf

works but

  pdfmom -Kutf8 -k timeline.mom > timeline.pdf

fails.

Deri's not with us these days and I'm loathe to muck about in
somebody else's code.  Perhaps someone less timid could have a look.
Alternatively, I suppose the documentation for -K could simply be
amended to read "don't use -k with -K" (though more elegantly
worded) instead of "implies -k".

-- 
Peter Schaffter
https://www.schaffter.ca



Re: Accent mystery

2024-02-20 Thread Tadziu Hoffmann


> Processed with
>   pdfmom -Kutf8 -k  timeline.mom > timeline.pdf
> the é is garbage.

If I swap the order of the options:

  pdfmom -k -Kutf8 timeline.mom >timeline.pdf

or leave out the "-k" entirely (since it is implied by "-K"):

  pdfmom -Kutf8 timeline.mom >timeline.pdf

it works on my machine.





Re: Accent mystery

2024-02-20 Thread Peter Schaffter
Hi, Branden.

On Mon, Feb 19, 2024, G. Branden Robinson wrote:
> At 2024-02-19T12:39:53-0500, Peter Schaffter wrote:
> > Your minimal file renders fine on my system without -Kutf8 *and* I
> > recently encountered a file with a single accented character where
> > passing -Kutf8 had no effect (I had to introduce the character
> > "silently" in an unused diversion to fix the problem).
> 
> I'd be curious to see the output you get from "preconv -d >/dev/null" on
> that file.

On further inspection, the problem seems to be with pdfmom.  Here's
a test file:

.\" Filename: timeline.mom
.
.TITLE"Repair Timeline
.SUBTITLE "6-331 Mona Ave" "Vanier ON K1L 7A3" "File no: LTB-T-028792-22"
.PRINTSTYLE TYPESET
.START
.PP
In this document, Dimitrius (also called "Dim") refers to Dimitrius
Stavrou, proxy landlord and property manager of 331 Mona Ave. for
the period of my tenancy up to January 2024.  Mr. Stavrou resides in
Montréal.

Processed with
  groff -mom -Tpdf -Kutf8 -k timeline.mom > timeline.pdf
the é in Montréal renders correctly.

Processed with
  pdfmom -Kutf8 -k  timeline.mom > timeline.pdf
the é is garbage.

-- 
Peter Schaffter
https://www.schaffter.ca



Re: Accent mystery

2024-02-19 Thread G. Branden Robinson
Hi Peter,

At 2024-02-19T12:39:53-0500, Peter Schaffter wrote:
> Your minimal file renders fine on my system without -Kutf8 *and* I
> recently encountered a file with a single accented character where
> passing -Kutf8 had no effect (I had to introduce the character
> "silently" in an unused diversion to fix the problem).

I'd be curious to see the output you get from "preconv -d >/dev/null" on
that file.

Regards,
Branden


signature.asc
Description: PGP signature


Re: Accent mystery

2024-02-19 Thread Peter Schaffter
Robert --

On Mon, Feb 19, 2024, Robert Goulding via wrote:
> I have been trying to figure this out all morning! I have a handout with
> the word "kataskeuê" in it. Every time I try to compile it (groff -Tpdf -k
> -ms) I get the warning: warning: special character 'u0053_0326' not defined
> (Same if I go the ps2pdf route)
> 
> 
> Try and compile this minimal file
> 
> .LP
> kataskeuê
> 
> Do you get a warning, and a weird character in the pdf?

I didn't get the warning and the pdf rendered properly.  However, I
had something similar crop up recently and the solution turned out
to be related to the fact that my file contained only one accented
character.  In such cases, explictly stating the character encoding
with the -K flag is supposed to fix the issue.  But...

Your minimal file renders fine on my system without -Kutf8 *and* I
recently encountered a file with a single accented character where
passing -Kutf8 had no effect (I had to introduce the character
"silently" in an unused diversion to fix the problem).

-- 
Peter Schaffter
https://www.schaffter.ca



Re: Accent mystery

2024-02-19 Thread Robert Goulding via
A, thank you so much (I needed to RTFM!) - R.

On Mon, Feb 19, 2024 at 12:44 PM G. Branden Robinson <
g.branden.robin...@gmail.com> wrote:

> Hi Robert,
>
> At 2024-02-19T12:40:16-0500, Robert Goulding via wrote:
> > To answer my own question: It seems that preconv is not guessing the
> > correct encoding from the file with a single word in it.  If I specify
> > -K utf-8 everything works OK.
> >
> > preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv
> > support and with uchardet support
> >
> > Is this an expected shortcoming of preconv - that if a file contains
> > just a single accented character, it won't guess it correctly? The
> > original file it failed on was a 2-page pdf, which has the word
> > kataskeuê in the middle of it.
>
> Yes.  The man page says:
>
>Coding tags
>  Text editors that support more than a single character encoding
>  need tags within the input files to mark the file’s encoding.
>  While it is possible to guess the right input encoding with the
>  help of heuristics that produce good results for a preponderance of
>  natural language texts, they are not absolutely reliable.
>  Heuristics can fail on inputs that are too short or don’t represent
>  a natural language.
> [...]
>  The use of iconv means that characters in the input that encode
>  invalid code points for that encoding may be dropped from the
>  output stream or mapped to the Unicode replacement character
>  (U+FFFD).  Compare the following examples using the input “café”
>  (note the “e” with an acute accent), which due to its short length
>  challenges inference of the encoding used.
> printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
> printf 'caf\351\n' | preconv -e us-ascii
> printf 'caf\351\n' | preconv -e latin-1
>  The fate of the accented “e” differs in each case.  In the first,
>  uchardet fails to detect an encoding (though the library on your
>  system may behave differently) and preconv falls back to the locale
>  settings, where octal 351 starts an incomplete UTF‐8 sequence and
>  results in the Unicode replacement character.  In the second, it is
>  not a representable character in the declared input encoding of US‐
>  ASCII and is discarded by iconv.  In the last, it is correctly
>  detected and mapped.
>
> Regards,
> Branden
>


-- 
Robert Goulding
Director, John J. Reilly Center for Science, Technology, and Values;
Assoc. Professor, Program of Liberal Studies,
Fellow, Medieval Institute,
University of Notre Dame.


Re: Accent mystery

2024-02-19 Thread G. Branden Robinson
Hi Robert,

At 2024-02-19T12:40:16-0500, Robert Goulding via wrote:
> To answer my own question: It seems that preconv is not guessing the
> correct encoding from the file with a single word in it.  If I specify
> -K utf-8 everything works OK.
> 
> preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv
> support and with uchardet support
> 
> Is this an expected shortcoming of preconv - that if a file contains
> just a single accented character, it won't guess it correctly? The
> original file it failed on was a 2-page pdf, which has the word
> kataskeuê in the middle of it.

Yes.  The man page says:

   Coding tags
 Text editors that support more than a single character encoding
 need tags within the input files to mark the file’s encoding.
 While it is possible to guess the right input encoding with the
 help of heuristics that produce good results for a preponderance of
 natural language texts, they are not absolutely reliable.
 Heuristics can fail on inputs that are too short or don’t represent
 a natural language.
[...]
 The use of iconv means that characters in the input that encode
 invalid code points for that encoding may be dropped from the
 output stream or mapped to the Unicode replacement character
 (U+FFFD).  Compare the following examples using the input “café”
 (note the “e” with an acute accent), which due to its short length
 challenges inference of the encoding used.
printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
printf 'caf\351\n' | preconv -e us-ascii
printf 'caf\351\n' | preconv -e latin-1
 The fate of the accented “e” differs in each case.  In the first,
 uchardet fails to detect an encoding (though the library on your
 system may behave differently) and preconv falls back to the locale
 settings, where octal 351 starts an incomplete UTF‐8 sequence and
 results in the Unicode replacement character.  In the second, it is
 not a representable character in the declared input encoding of US‐
 ASCII and is discarded by iconv.  In the last, it is correctly
 detected and mapped.

Regards,
Branden


signature.asc
Description: PGP signature


Re: Accent mystery

2024-02-19 Thread Robert Goulding via
To answer my own question: It seems that preconv is not guessing the
correct encoding from the file with a single word in it.  If I specify -K
utf-8 everything works OK.

preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv support
and with uchardet support

Is this an expected shortcoming of preconv - that if a file contains just a
single accented character, it won't guess it correctly? The original file
it failed on was a 2-page pdf, which has the word kataskeuê in the middle
of it.

Robert.

On Mon, Feb 19, 2024 at 10:52 AM Robert Goulding 
wrote:

> I have been trying to figure this out all morning! I have a handout with
> the word "kataskeuê" in it. Every time I try to compile it (groff -Tpdf -k
> -ms) I get the warning: warning: special character 'u0053_0326' not defined
> (Same if I go the ps2pdf route)
>
>
> Try and compile this minimal file
>
> .LP
> kataskeuê
>
> Do you get a warning, and a weird character in the pdf?
>
> But *this *minimal file compiles just fine:
>
> .LP
> kataskeuê êéè
>
> No warnings, all the characters come out correct. What could be the
> reason? (Using groff 1.23.0)
>
> R.
>
> --
> Robert Goulding
> Director, John J. Reilly Center for Science, Technology, and Values;
> Assoc. Professor, Program of Liberal Studies,
> Fellow, Medieval Institute,
> University of Notre Dame.
>


-- 
Robert Goulding
Director, John J. Reilly Center for Science, Technology, and Values;
Assoc. Professor, Program of Liberal Studies,
Fellow, Medieval Institute,
University of Notre Dame.


Accent mystery

2024-02-19 Thread Robert Goulding via
I have been trying to figure this out all morning! I have a handout with
the word "kataskeuê" in it. Every time I try to compile it (groff -Tpdf -k
-ms) I get the warning: warning: special character 'u0053_0326' not defined
(Same if I go the ps2pdf route)


Try and compile this minimal file

.LP
kataskeuê

Do you get a warning, and a weird character in the pdf?

But *this *minimal file compiles just fine:

.LP
kataskeuê êéè

No warnings, all the characters come out correct. What could be the reason?
(Using groff 1.23.0)

R.

-- 
Robert Goulding
Director, John J. Reilly Center for Science, Technology, and Values;
Assoc. Professor, Program of Liberal Studies,
Fellow, Medieval Institute,
University of Notre Dame.