Package: ghostscript
Version: 9.56.1~dfsg-1
Severity: normal
Tags: upstream
Forwarded: https://bugs.ghostscript.com/show_bug.cgi?id=705246

When an input PDF file has a character like U+2308 LEFT CEILING and
has a ToUnicode CMap, the new PDF interpreter may yield an incorrect
ToUnicode CMap in the generated PDF. The issue seems to be limited
to characters like math symbols (in the same font as the problematic
character?), though; letters, including accented ones, do not seem
to be affected.

Here's a shell script used for some testing:

────────────────────────────────────────────────────────────────────────
#!/bin/sh

set -e

out()
{
  echo -n "$i$j ($1):"
  printf " %s" $(pdftotext chartest9$i$j$2.pdf - | tr -d '\f')
  echo
}

for i in a b
do
  for j in 0 1
  do
    cat <<'EOF' | sed "s/:$i/\\\\lceil/" | \
                  sed "s/:a//" | \
                  sed "s/J/$j/" > chartest9.tex
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\pdfgentounicode=J
\begin{document}
\thispagestyle{empty}
$\in:a$
\end{document}
EOF

    pdflatex chartest9.tex > /dev/null
    mv chartest9.pdf chartest9$i$j.pdf
    out "pdfTeX" ""

    ps2pdf14 chartest9$i$j.pdf chartest9$i$j-new.pdf
    out "gs new" "-new"

    ps2pdf14 -dNEWPDF=false chartest9$i$j.pdf chartest9$i$j-old.pdf
    out "gs old" "-old"
  done
done
────────────────────────────────────────────────────────────────────────

See the upstream bug for the obtained PDF files.

4 kinds of PDF inputs are tested (a0, a1, b0, b1), where
  * a: the content corresponds to "∈⌈" (ELEMENT OF + LEFT CEILING)
  * b: the content corresponds to "∈" (ELEMENT OF)
  * 0: \pdfgentounicode=0 (pdfTeX does not generate a ToUnicode CMap)
  * 1: \pdfgentounicode=1 (pdfTeX generates a ToUnicode CMap)

I've compared (see above script for details):
  * pdfTeX: PDF file generated by pdfTeX from TeX Live 2022
  * gs new: PDF file obtained with the new PDF interpreter (default)
  * gs old: PDF file obtained with the old PDF interpreter (dNEWPDF=false)

I've done the tests with the ghostscript 9.56.1~dfsg-1 Debian package.

If LEFT CEILING is not present, Ghostscript does not generate
a ToUnicode CMap in all of these cases, which is fine. But if
this character is present:

1. With the old PDF interpreter, Ghostscript generates a correct
ToUnicode CMap.

2. With the new PDF interpreter and no input ToUnicode CMap,
Ghostscript does not generate a ToUnicode CMap (the only practical
issue is that one cannot get unual characters like LEFT CEILING, but
this is not worse than what TeX Live 2022 can yield in any case).

3. With the new PDF interpreter and an input ToUnicode CMap like
the one from TeX Live 2022, Ghostscript generates an incorrect
ToUnicode CMap, which prevents one from getting usual math
characters such as ELEMENT OF.

The results, where I've added ToUnicode CMap information (which I have
obtained with "qpdf --stream-data=uncompress" on these PDF files):

a0 (pdfTeX): ∈d (no CMap)
a0 (gs new): ∈d (no CMap)
a0 (gs old): ∈⌈ (CMap old)
a1 (pdfTeX): ∈d (CMap 1)
a1 (gs new):    (CMap 1-new)
a1 (gs old): ∈⌈ (CMap old)
b0 (pdfTeX): ∈  (no CMap)
b0 (gs new): ∈  (no CMap)
b0 (gs old): ∈  (no CMap)
b1 (pdfTeX): ∈  (CMap 1)
b1 (gs new): ∈  (no CMap)
b1 (gs old): ∈  (no CMap)

with the following ToUnicode CMaps:

CMap old:
────────────────────────────────────────
begincmap
/CMapType 2 def
/CMapName/R11 def
1 begincodespacerange
<00><ff>
endcodespacerange
2 beginbfrange
<32><32><2208>
<64><64><2308>
endbfrange
endcmap
────────────────────────────────────────

CMap 1:
────────────────────────────────────────
begincmap
/CIDSystemInfo
<< /Registry (TeX)
/Ordering (lmsy10-lm-mathsy)
/Supplement 0
>> def
/CMapName /TeX-lmsy10-lm-mathsy-0 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
0 beginbfrange
endbfrange
0 beginbfchar
endbfchar
endcmap
────────────────────────────────────────

CMap 1-new:
────────────────────────────────────────
begincmap
/CMapType 2 def
/CMapName/R11 def
1 begincodespacerange
<00><ff>
endcodespacerange
2 beginbfrange
<32><32><00>
<64><64><00>
endbfrange
endcmap
────────────────────────────────────────

-- System Information:
Debian Release: bookworm/sid
  APT prefers unstable-debug
  APT policy: (500, 'unstable-debug'), (500, 'stable-updates'), (500, 
'stable-security'), (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 
'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 5.17.0-1-amd64 (SMP w/8 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, 
TAINT_UNSIGNED_MODULE
Locale: LANG=POSIX, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages ghostscript depends on:
ii  libc6   2.33-7
ii  libgs9  9.56.1~dfsg-1

ghostscript recommends no packages.

Versions of packages ghostscript suggests:
ii  ghostscript-x  9.56.1~dfsg-1

-- no debconf information

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply via email to