The older versions switched to another font (couldn't render type1 fonts), while 1.8.5 tries to render the embedded fonts with awt, and this is not always going well.

Anyway, see

https://issues.apache.org/jira/browse/PDFBOX-2000
https://issues.apache.org/jira/browse/PDFBOX-1019

and try to use the 2.0 version.

Tilman



Am 02.05.2014 07:49, schrieb Anupama Krishnan:
Hello,

I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual: http://www.greenstone.org/docs/greenstone3/manual.pdf

It removed the fl and fi prefixes from words like "flexible", "file" and "first". Perhaps these genuine word prefixes have been confused with f-based ligatures?

We were previously using a pdfbox-app 1.5.* version and wanted to switch over to a newer one. Version 1.8.2 does not have this issue.


The command we ran:
java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />" org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"

Relevant excerpts from the output generated:
- "improve exibility, modularity, and extensibility"
the 2nd word should be "flexibillity"
- "Table 1 shows the le hierarchy for Greenstone3. The rst part shows the common"
The words "file" and "first" have been truncated to "le" and "rst"

I believe this is rather a bug than intended behaviour.

Kind regards,
Anupama

Reply via email to