Hi. I've been using PDFBox 0.7.3 for text extraction and indexing with Lucene for some time now and I found that with some of ours PDF files, that have complex design and "rare" fonts, the extracted text came without white space between words. That occurred because of the factor used for calculating characters spacing in org.pdfbox.util.pdftextstripper.j...@flushtext method, lines 442 and 446. The original factor is 0.50f but I found that it worked better (in my case) with 0.30f.
My intention is to let PDFBox developers (and anyone else) know this data, and because I saw that the new version, 0.8.0, has the same factor. I really don't know if this is the best site to drop this info, if it's not, I apologize. Looking forward to see new version working. :) Thanks, ____________________ Felipe C. Meirelles [email protected]
