Hello there, > > I have a PDF file with 1 page only, when I try to extract its text using : > String pageData = stripper.getText( pdfFile ); >
I used the standard org.apache.pdfbox.ExtractText command-line application to verify your issue. > It ignores some Enter characters between lines, so the last word in the line > and the first word in the next line appear as 1 word without spaces between > them !! > > You can download the PDF file from here to try it : > http://www.4shared.com/file/185259485/5d937eb/Enters-sample.html > The problem arises from the styling of the first letter of the paragraph ("P"). Currently it is painted in green, and it uses a font size which is over five times bigger than that of all the remaining characters (88 vs. 11). The current line detection algorithm is not prepared to handle such edge cases. In other words, since the letter "P" overlaps vertically with all the remaining characters, they are all considered to represent a single line, which is why no newline characters (ie. "Enters") are emitted. > Is there a way to fix this ? > If you're good at Java programming you might look into org.apache.pdfbox.util.PDFTextStripper and see if you can improve the line detection algorithm to take into consideration horizontal overlaps in addition to vertical overlaps. Alternatively, you might try to override parts of org.apache.pdfbox.util.PDFStreamEngine with the intent of "correcting" the font size of suspicious TextPosition instances (eg. the ones which are painted in green and whose height is at least three times that of their previous/next sibling). Otherwise, you might file an issue into PDFBox issue tracker and hope that someone does it for you. VR

