Hi. I've been using PDFBox 0.7.3 for text extraction and indexing with
Lucene for some time now and I found that with some of ours PDF files, that
have complex design and "rare" fonts, the extracted text came without white
space between words. That occurred because of the factor used for
calculating characters spacing in
org.pdfbox.util.pdftextstripper.j...@flushtext method, lines 442 and 446.
The original factor is 0.50f but I found that it worked better (in my case)
with 0.30f.

My intention is to let PDFBox developers (and anyone else) know this data,
and because I saw that the new version, 0.8.0, has the same factor. I really
don't know if this is the best site to drop this info, if it's not, I
apologize.

Looking forward to see new version working. :)

Thanks,
____________________
Felipe C. Meirelles

[email protected]

Reply via email to