Hi all, Even when pdftotext is run with option "-enc UTF-8", it converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder whether this feature is intended or not.
In French, high punctuation characters (:;!?) should be preceeded with non-breaking spaces; Unicode characters, U+a0 for `:' and U+202f (thin) for the three others, are perfect for this purpose. French quote characters `«' and `»' also need U+a0 non-breaking space. When I run: pdftotext -enc UTF-8 file.pdf file.txt on a Unicode encoded PDF file which holds such sequences, the output file shows all high punctuation characters preceeded with the same breakable U+20 space, which looks wrong to me. I am using version 0.48 included in Debian Stretch. I append a simple test file "spaces.pdf" (fyi it was produced by LuaTeX) and "spaces.txt" the output of "pdftotext -enc UTF-8". Cheers, -- Daniel Flipo
spaces.pdf
Description: Adobe PDF document
a : b ; c ! d ! « x ».
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
