Re: Problem extracting text in Enter chars

Villu Ruusmann Sun, 03 Jan 2010 00:41:29 -0800

Hello there,

>
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
>


I used the standard org.apache.pdfbox.ExtractText command-line
application to verify your issue.

> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
>
> You can download the PDF file from here to try it :
> http://www.4shared.com/file/185259485/5d937eb/Enters-sample.html
>

The problem arises from the styling of the first letter of the
paragraph ("P"). Currently it is painted in green, and it uses a font
size which is over five times bigger than that of all the remaining
characters (88 vs. 11).

The current line detection algorithm is not prepared to handle such
edge cases. In other words, since the letter "P" overlaps vertically
with all the remaining characters, they are all considered to
represent a single line, which is why no newline characters (ie.
"Enters") are emitted.

> Is there a way to fix this ?
>

If you're good at Java programming you might look into
org.apache.pdfbox.util.PDFTextStripper and see if you can improve the
line detection algorithm to take into consideration horizontal
overlaps in addition to vertical overlaps. Alternatively, you might
try to override parts of org.apache.pdfbox.util.PDFStreamEngine with
the intent of "correcting" the font size of suspicious TextPosition
instances (eg. the ones which are painted in green and whose height is
at least three times that of their previous/next sibling).

Otherwise, you might file an issue into PDFBox issue tracker and hope
that someone does it for you.


VR

Re: Problem extracting text in Enter chars

Reply via email to