Hello,
I am dealing with PDF files that have been created using TeX. This
seems to create some specific problems.
These are earlier papers from the 1990s, newer may be more
standardised and present fewer problems.
1. German Umlauts may or may not be recognised.
For "Hölder" I get once "Ho¨lder" and once "H¨older" in the same
document. "Ho¨lder" would be correct in UTF-8 if the diaeresis would
be combining (Unicode 308) but it is the not combining variety
(Unicode A8). The same appears in the html version (here: ¨). The
not combining character is not a real problem, but putting it before
once and after the other time is. PDFBox 0.7.3 seems to use
consistently the version "H¨older".
2. Some Ligatures are lost: I get "de nition" for "definition" (the
ligature "fi" for "fi" is replaced with a space). The same holds for
example for all the words in "fix first satisfies defined finite" and
many others. On the other hand, "reflecting" is correctly resolved to
"reflecting".
Any chances this can be fixed?
All the best
Thomas