Hi,
Am 02.05.2014 07:49, schrieb Anupama Krishnan:
Hello,
I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual:
http://www.greenstone.org/docs/greenstone3/manual.pdf
It removed the fl and fi prefixes from words like "flexible", "file" and
"first". Perhaps these genuine word prefixes have been confused with f-based
ligatures?
We were previously using a pdfbox-app 1.5.* version and wanted to switch over to
a newer one. Version 1.8.2 does not have this issue.
The command we ran:
java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />"
org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"
Relevant excerpts from the output generated:
- "improve exibility, modularity, and extensibility"
the 2nd word should be "flexibillity"
- "Table 1 shows the le hierarchy for Greenstone3. The rst part shows the
common"
The words "file" and "first" have been truncated to "le" and "rst"
I believe this is rather a bug than intended behaviour.
Yes, I can reproduce that behaviour and created an issue [1] on JIRA.
Kind regards,
Anupama
Thanks for the report
BR
Andreas Lehmkühler
[1] https://issues.apache.org/jira/browse/PDFBOX-2058