Thanks a lot Villu for your reply.
In other words, since the letter "P" overlaps vertically with all the
remaining characters
I see the problem also in the bottom lines of the same page which is far
away from the "P" letter. And I have noticed that the problem happens in PDF
pages that are divided into 2 parts.
Otherwise, you might file an issue into PDFBox issue tracker and hope
that someone does it for you.
I think I will do that, and I hope someone in a couple of years will fix
this :)
Best regards ,
Hesham
--------------------------------------------------
Hello there,
I have a PDF file with 1 page only, when I try to extract its text using
:
String pageData = stripper.getText( pdfFile );
I used the standard org.apache.pdfbox.ExtractText command-line
application to verify your issue.
It ignores some Enter characters between lines, so the last word in the
line and the first word in the next line appear as 1 word without spaces
between them !!
You can download the PDF file from here to try it :
http://www.4shared.com/file/185259485/5d937eb/Enters-sample.html
The problem arises from the styling of the first letter of the
paragraph ("P"). Currently it is painted in green, and it uses a font
size which is over five times bigger than that of all the remaining
characters (88 vs. 11).
The current line detection algorithm is not prepared to handle such
edge cases. In other words, since the letter "P" overlaps vertically
with all the remaining characters, they are all considered to
represent a single line, which is why no newline characters (ie.
"Enters") are emitted.
Is there a way to fix this ?
If you're good at Java programming you might look into
org.apache.pdfbox.util.PDFTextStripper and see if you can improve the
line detection algorithm to take into consideration horizontal
overlaps in addition to vertical overlaps. Alternatively, you might
try to override parts of org.apache.pdfbox.util.PDFStreamEngine with
the intent of "correcting" the font size of suspicious TextPosition
instances (eg. the ones which are painted in green and whose height is
at least three times that of their previous/next sibling).
Otherwise, you might file an issue into PDFBox issue tracker and hope
that someone does it for you.
VR