Hello,
I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual:
http://www.greenstone.org/docs/greenstone3/manual.pdf
It removed the fl and fi prefixes from words like "flexible", "file" and
"first". Perhaps these genuine word prefixes have been confused with
f-based ligatures?
We were previously using a pdfbox-app 1.5.* version and wanted to switch
over to a newer one. Version 1.8.2 does not have this issue.
The command we ran:
java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />"
org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"
Relevant excerpts from the output generated:
- "improve exibility, modularity, and extensibility"
the 2nd word should be "flexibillity"
- "Table 1 shows the le hierarchy for Greenstone3. The rst part shows
the common"
The words "file" and "first" have been truncated to "le" and "rst"
I believe this is rather a bug than intended behaviour.
Kind regards,
Anupama