[Libreoffice-bugs] [Bug 156079] Text in exported PDF fails to be c&p'ed

bugzilla-daemon Sat, 01 Jul 2023 23:50:18 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=156079


--- Comment #5 from ⁨خالد حسني⁩ <kha...@libreoffice.org> ---
Copying from Adobe Reader, I get:

012345 6789

(no funny characters, but there an extra space which is not surprising as many
PDF readers will interpret a large gap between glyphs as space even if the PDF
does not have a space character there)

If I use pdftotext, I get:

0123456789

The number grouping is a “feature” of Linux Libertine G font, but it is done in
a very odd way that affects PDF export.

$ hb-shape LinBiolinum_R_G.ttf "0123456789" --no-positions
[zero=0|uni202F=1|one=1|two=2|three=3|uni202F=4|four=4|five=4|six=6|uni202F=7|seven=7|eight=7|nine=7]

(the text before equal sign is the glyph name, and the number after it is the
index of the input string corresponding to this character)

The font output zero fine, no funny business. Then it outputs the glyph for
NNBSP then glyph for one and gives both the same input string index, then two
and three normally, then NNBSP, four and five and gives all the three of them
the same input string index, then six normally, then NNBSP, seven, eight and
nine and gives the four of them the same input string index.

This funny business with input string index leads us to group the output as the
following mapping between glyphs and input characters:

zero => "0"
uni202F,one => "1"
two => "2"
three => "3"
uni202F,four,five => "45"
six => "6"
uni202F,seven,eight,nine => "789"

This mapping of multiple glyphs to multiple input characters is problematic in
PDF for text extraction, since PDF can represent only one glyph to one
character or one glyph ti multiple characters mapping. To keep the text
copy-able we have to resent to tagging the problematic glyph groups using
/ActualText spans, and not all PDF viewers support this.

So this a combination of oddly built font and buggy PDF viewers, we are doing
our best and there is not much we can do about this.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 156079] Text in exported PDF fails to be c&p'ed

Reply via email to