[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Daniel Flipo Sat, 02 Sep 2017 01:25:25 -0700

Hi all,

Even when pdftotext is run with option "-enc UTF-8", it converts all
non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
whether this feature is intended or not.


In French, high punctuation characters (:;!?) should be preceeded with
non-breaking spaces; Unicode characters, U+a0 for `:' and U+202f (thin)
for the three others, are perfect for this purpose.

French quote characters `«' and `»' also need U+a0 non-breaking space.

When I run:

pdftotext -enc UTF-8  file.pdf file.txt

on a Unicode encoded PDF file which holds such sequences, the output
file shows all high punctuation characters preceeded with the same
breakable U+20 space, which looks wrong to me.

I am using version 0.48 included in Debian Stretch.

I append a simple test file "spaces.pdf" (fyi it was produced by LuaTeX)
and "spaces.txt" the output of "pdftotext -enc UTF-8".

Cheers,
-- 
Daniel Flipo

spaces.pdf
Description: Adobe PDF document

a : b ; c ! d ! « x ».

_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Reply via email to