[
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976922#action_12976922
]
Mel Martinez commented on PDFBOX-588:
-------------------------------------
there are two separate problems discussed here.
One is line separation detection. That seems to be reasonably well addressed
in PDFBox-521.
The other problem that Hesham refers to is the dropping of space characters.
This is a different bit of logic in PDFBox from line/paragraph/article/page
demarcation. There is code that looks at the positional gap between two
characters and decides whether to insert a space character or not. The
accuracy of this looks to be somewhat dependent on the fonts used.
I actually am seeing the inverse trend. On some of our test documents we used
to see this problem with PDFBox v 1.0 but it has been fixed in in the later
code - at least for these particular documents.
It is possible that the problem has just been shifted so that different fonts
result in the problem.
You can try adjusting the 'PDFTextStripper.setSpacingTolerance(float)' method
to change the behavior. The default value is 0.5f. Try making this larger or
smaller and see how it behaves for you. A smaller value should increase the
insertion of spaces.
> Problem extracting text in newline characters
> ---------------------------------------------
>
> Key: PDFBOX-588
> URL: https://issues.apache.org/jira/browse/PDFBOX-588
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: Win XP
> Reporter: Hesham
> Attachments: Enters-sample.pdf, PDFTextStripper.patch
>
>
> Hello ,
>
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line
> and the first word in the next line appear as 1 word without spaces between
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text
> editor, Enter characters appear after the same lines that caused the problem
> in PDFBox.
> Please check the attached file as a sample.
>
> Is there a way to fix this ?
>
> Best regards ,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.