The older versions switched to another font (couldn't render type1
fonts), while 1.8.5 tries to render the embedded fonts with awt, and
this is not always going well.
Anyway, see
https://issues.apache.org/jira/browse/PDFBOX-2000
https://issues.apache.org/jira/browse/PDFBOX-1019
and try to use the 2.0 version.
Tilman
Am 02.05.2014 07:49, schrieb Anupama Krishnan:
Hello,
I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual:
http://www.greenstone.org/docs/greenstone3/manual.pdf
It removed the fl and fi prefixes from words like "flexible", "file"
and "first". Perhaps these genuine word prefixes have been confused
with f-based ligatures?
We were previously using a pdfbox-app 1.5.* version and wanted to
switch over to a newer one. Version 1.8.2 does not have this issue.
The command we ran:
java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />"
org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"
Relevant excerpts from the output generated:
- "improve exibility, modularity, and extensibility"
the 2nd word should be "flexibillity"
- "Table 1 shows the le hierarchy for Greenstone3. The rst part shows
the common"
The words "file" and "first" have been truncated to "le" and "rst"
I believe this is rather a bug than intended behaviour.
Kind regards,
Anupama