Re: Baffling accented glyphs issue

2023-08-27 Thread G. Branden Robinson
Hi Peter,

At 2023-08-27T00:06:39-0400, Peter Schaffter wrote:
> On Sat, Aug 26, 2023, Bjarni Ingi Gislason wrote:
> > Lesson to learn: always use the information you have and give it to
> > the program, for example "groff" with option "-K.
> > 
> > groff -Kutf8 -V file
> > 
> > preconv -eutf8 file | troff -Tps | grops
> 
> Thanks, Bjarni.  Odd to be discovering this for the first time after
> decades of preparing documents.  Live and learn.

In groff 1.23.0, the preconv(1) man pages attempts to address this
subject more directly.

   iconv support
   While preconv recognizes all of the coding tags listed above, it
   is capable on its own of interpreting only three encodings:
   Latin‐1, code page 1047, and UTF‐8.  If iconv support is
   configured at compile time and available at run time, all others
   are passed to iconv library functions, which may recognize many
   additional encoding strings.  The command “preconv -v” discloses
   whether iconv support is configured.

   The use of iconv means that characters in the input that encode
   invalid code points for that encoding may be dropped from the
   output stream or mapped to the Unicode replacement character
   (U+FFFD).  Compare the following examples using the input “café”
   (note the “e” with an acute accent), which due to its short
   length challenges inference of the encoding used.
  printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
  printf 'caf\351\n' | preconv -e us-ascii
  printf 'caf\351\n' | preconv -e latin-1
   The fate of the accented “e” differs in each case.  In the first,
   uchardet fails to detect an encoding (though the library on your
   system may behave differently) and preconv falls back to the
   locale settings, where octal 351 starts an incomplete UTF‐8
   sequence and results in the Unicode replacement character.  In
   the second, it is not a representable character in the declared
   input encoding of US‐ASCII and is discarded by iconv.  In the
   last, it is correctly detected and mapped.

Regards,
Branden


signature.asc
Description: PGP signature


Re: Baffling accented glyphs issue

2023-08-26 Thread Peter Schaffter
On Sat, Aug 26, 2023, Bjarni Ingi Gislason wrote:
>   Lesson to learn: always use the information you have and give it to
> the program, for example "groff" with option "-K.
> 
> groff -Kutf8 -V file
> 
> preconv -eutf8 file | troff -Tps | grops

Thanks, Bjarni.  Odd to be discovering this for the first time after
decades of preparing documents.  Live and learn.

-- 
Peter Schaffter
https://www.schaffter.ca



Re: Baffling accented glyphs issue

2023-08-26 Thread Bjarni Ingi Gislason
preconv -d  shows

fallback encoding: 'ISO-8859-1'
processing 'txtFS_cF9dOUO.txt'
  no coding tag
  len: 95
  uchardet read: 95 bytes
  charset: ISO-8859-13
  encoding used: 'ISO-8859-13'
.lf 1 txtFS_cF9dOUO.txt
.sp |1i-1v
Ce qui me pla\[u0106]\[u00AE]t le plus, c'est quand je suis assis
confortablement avec mes
chats.



Re: Baffling accented glyphs issue

2023-08-26 Thread Steve Izma
On Sat, Aug 26, 2023 at 04:47:02PM -0400, Peter Schaffter wrote:
> Subject: Baffling accented glyphs issue
> 
> chats_1 contains a single accented glyph (î).  The glyph is mangled
> in the ps output. chats_2 contains the same glyph plus an additional
> one (é).  Here, neither glyph is mangled in the output.  The same
> oddity occurs with the pdf driver and with -Tutf8.
> 
> Even more peculiar is that the introduction of *any* accented glyph
> into the source file (in addition to the originally mangled glyph),
> even one commented out at the end of the file, fixes the problem
> with the initial mangled glyph.  Try adding
> 
>   .\"ô
> 
> or similar to the end of chats_1 and processing it to see what I mean.
> 
> I'm not sure if this is new behaviour because I can't
> recall ever creating a document with only one accented glyph.

Hi Peter,

I've noticed this behaviour only lately myself. I think Bjarni's
explanation accounts for it.

But your email encoded the texts as iso-8859-1, not utf8, so when I
saved the file and ran it there wasn't a problem.

When I created my own file with the single accented character as
utf-8 I got the same problem as you indicated.

Using -K utf8 solves it. I guess I rarely have files with only a
single example of a utf-8 character. I use the utf-8 open and
closing double quotes very frequently so that probably makes the
difference to preconv.

-- Steve

-- 
Steve Izma
-
Home: 35 Locust St., Kitchener, Ontario, Canada  N2H 1W6
E-mail: si...@golden.net  phone: 519-745-1313
cell (text only; not frequently checked): 519-998-2684

==
The most erroneous stories are those we think we know best – and
therefore never scrutinize or question.
-- Stephen Jay Gould, *Full House: The Spread of Excellence
   from Plato to Darwin*, 1996



Re: Baffling accented glyphs issue

2023-08-26 Thread Deri
On Saturday, 26 August 2023 21:47:02 BST Peter Schaffter wrote:
> Here's something I haven't encountered before.  Have a look at the
> attached files.  ps output was produced with
> 
>   groff -k file > file.ps
> 
> chats_1 contains a single accented glyph (î).  The glyph is mangled
> in the ps output. chats_2 contains the same glyph plus an additional
> one (é).  Here, neither glyph is mangled in the output.  The same
> oddity occurs with the pdf driver and with -Tutf8.
> 
> Even more peculiar is that the introduction of *any* accented glyph
> into the source file (in addition to the originally mangled glyph),
> even one commented out at the end of the file, fixes the problem
> with the initial mangled glyph.  Try adding
> 
>   .\"ô
> 
> or similar to the end of chats_1 and processing it to see what I mean.
> 
> I'm not sure if this is new behaviour because I can't
> recall ever creating a document with only one accented glyph.
> 
> Ideas, anyone?

Hi Peter,

I can't duplicate. When you give groff just -k preconv has to guess the 
encoding. So my guess is that our preconvs are different. It can use iconv and 
uchardet for this guessing. What does preconv -v tell you, mine says:-

[derij@pip Chats (deri-gropdf-ng)]$ preconv -v
GNU preconv (groff) version 1.23.0.16-a53f5-dirty with iconv support and with 
uchardet support

It is the support information which is important, you may be missing one of 
them from your build. The preconv man page has information of how it guesses, 
or you can tell it explicitly the encoding to use.

Cheers 

Deri






Re: Baffling accented glyphs issue

2023-08-26 Thread Bjarni Ingi Gislason
 1) Look at the output of

preconv 

  The option '-k' for "groff" can't find the encoding with just one
example of a character different from ascii or latin1.

groff -k -V file

preconv file | troff -Tps | grops

2) Add the same character to the file.

preconv 

  Lesson to learn: always use the information you have and give it to
the program, for example "groff" with option "-K.

groff -Kutf8 -V file

preconv -eutf8 file | troff -Tps | grops

  This reduces the resources that are needed to find out
which encoding the software has to use.